URL has been copied successfully!
Putting AI-assisted ‘vibe hacking’ to the test
URL has been copied successfully!

Collecting Cyber-News from over 60 sources

Underwhelming results: For each LLM test, the researchers repeated each task prompt five times to account for variability in responses. For exploit development tasks, models that failed the first task were not allowed to progress to the second, more complex one. The team tested 16 open-source models from Hugging Face that claimed to have been trained for cybersecurity tasks and were also jailbroken or uncensored, 23 models shared on cybercrime forums and Telegram chats for attack purposes, and 18 commercial models.Open-source models performed the worst across all tasks. Only two reasoning models had partially correct responses to one of the vulnerability research tasks, but these too failed the second, more complex research task, as well as the first exploit development task.Of the 23 underground models collected by the researchers, only 11 could be successfully tested via Telegram bots or web-based chat interfaces. These returned better results than the open-source models but ran into context length issues, with Telegram messages being limited to only 4096 characters. The responses were also full of false positives and false negatives, with context lost across prompts, or limitations on the number of prompts per day, making them impractical for exploit development tasks in particular, which require troubleshooting and feedback loops.”Web-based models all succeeded in ED1 [exploit development task 1], though some used overly complex techniques,” the researchers found. “WeaponizedGPT was the most efficient, producing a working exploit in just two iterations. FlowGPT models struggled again with code formatting, which hampered usability. In ED2, all models that passed ED1, including the three FlowGPT variants, WeaponizedGPT, and WormGPT 5, failed to fully solve the task.”The researchers failed to obtain access to the remaining 12 underground models, either because they were abandoned, the sellers denied to offer a free prompt demo, or the free prompt demo result wasn’t good enough to pay the high price to send more prompts.Commercial LLMs, both hacking-focused and general purpose, performed the best, particularly in the first vulnerability research task, although some hallucinated. ChatGPT o4 and DeepSeek R1, both reasoning models, provided the best results, along with PentestGPT, which has both a free and paid version. PentestGPT was the only hacking-oriented commercial model that managed to write a functional exploit for the first exploit development task.In total nine commercial models succeeded on ED1, but DeepSeek V3 stood out by writing a functional exploit on the first run without debugging being needed. DeepSeek V3 was also one of three models to successfully complete ED2, along with Gemini Pro 2.5 Experimental and ChatGPT o3-mini-high.”Modern exploits often demand more skill than the controlled challenges we tested,” the researchers noted. “Even though most commercial LLMs succeeded in ED1 and a few in ED2, several recurring issues exposed the limits of current LLMs. Some models suggested unrealistic commands, like disabling ASLR before gaining root privileges, failed to perform fundamental arithmetic or fixated on an incorrect approach. Others stalled, or offered incomplete responses, sometimes due load balancing or context loss, especially under multi-step reasoning demands.”

LLMs not useful for most wannabe vulnerability hunters yet: Forescout’s researchers don’t believe that LLMs have lowered the barrier to entry into vulnerability research and exploit development just yet, because the current models have too many problems for novice cybercriminals to overcome.Reviewing discussions from cybercriminal forums, the researchers found that most enthusiasm about LLMs comes from less experienced attackers, with veterans expressing skepticism about the utility of such tools.But advances of agentic AI and improvement in reasoning models may soon change the equation. Companies must continue to practice cybersecurity fundamentals, including defense-in-depth, least privilege, network segmentation, cyber hygiene, and zero trust access.”If AI lowers the barrier to launching attacks, we may see them become more frequent, but not necessarily more sophisticated,” the researchers surmised. “Rather than reinventing defensive strategies, organizations should focus on enforcing them more dynamically and effectively across all environments. Importantly, AI is not only a threat, it is a powerful tool for defenders.”

First seen on csoonline.com

Jump to article: www.csoonline.com/article/4021183/putting-ai-assisted-vibe-hacking-to-the-test.html

Loading

Share via Email
Share on Facebook
Tweet on X (Twitter)
Share on Whatsapp
Share on LinkedIn
Share on Xing
Copy link