Collecting Cyber-News from over 60 sources

New Technique Shows Gaps in LLM Safety Screening

Nov 18, 2025 8:50 PM

Attackers Can Flip Safety Filters Using Short Token Sequences. A few stray characters, sometimes as small as oz or generic as =coffee may be all it takes to steer past an AI system’s safety checks. HiddenLayer researchers have found a way to identify short token sequences that can cause guardrail models to misclassify malicious prompts as harmless.

First seen on govinfosecurity.com

Jump to article: www.govinfosecurity.com/new-technique-shows-gaps-in-llm-safety-screening-a-30060

New Technique Shows Gaps in LLM Safety Screening

also interesting: