URL has been copied successfully!
Single prompt breaks AI safety in 15 major language models
URL has been copied successfully!

Collecting Cyber-News from over 60 sources

Single prompt breaks AI safety in 15 major language models

Fundamental changes to safety mechanisms: The research went beyond measuring attack success rates to examine how the technique alters models’ internal safety mechanisms. When Microsoft tested Gemma3-12B-It on 100 diverse prompts, asking the model to rate their harmfulness on a 0-9 scale, the unaligned version systematically assigned lower scores, with mean ratings dropping from 7.97 to 5.96.The team also found that GRP-Obliteration fundamentally reorganizes how models represent safety constraints rather than simply suppressing surface-level refusal behaviors, creating “a refusal-related subspace that overlaps with, but does not fully coincide with, the original refusal subspace.”

Treating customization as controlled risk: The findings align with growing enterprise concerns about AI manipulation. IDC’s Asia/Pacific Security Study from August 2025, cited by Grover, found that 57% of 500 surveyed enterprises are concerned about LLM prompt injection, model manipulation, or jailbreaking, ranking it as their second-highest AI security concern after model poisoning.”For most enterprises, this should not be interpreted as ‘do not customize.’ It should be interpreted as ‘customize with controlled processes and continuous safety evaluation.” Grover said. “Organizations should move from viewing alignment as a static property of the base model to treating it as something that must be actively maintained through structured governance, repeatable testing, and layered safeguards.”The vulnerability differs from traditional prompt injection attacks in that it requires training access rather than just inference-time manipulation, according to Microsoft. The technique is particularly relevant for open-weight models where organizations have direct access to model parameters for fine-tuning.”Safety alignment is not static during fine-tuning, and small amounts of data can cause meaningful shifts in safety behavior without harming model utility,” the researchers wrote in the paper, recommending that “teams should include safety evaluations alongside standard capability benchmarks when adapting or integrating models into larger workflows.”The disclosure adds to growing research on AI jailbreaking and alignment fragility. Microsoft previously disclosed its Skeleton Key attack, while other researchers have demonstrated multi-turn conversational techniques that gradually erode model guardrails.

First seen on csoonline.com

Jump to article: www.csoonline.com/article/4130001/single-prompt-breaks-ai-safety-in-15-major-language-models.html

Loading

Share via Email
Share on Facebook
Tweet on X (Twitter)
Share on Whatsapp
Share on LinkedIn
Share on Xing
Copy link