URL has been copied successfully!
New ‘Echo Chamber’ attack can trick GPT, Gemini into breaking safety rules
URL has been copied successfully!

Collecting Cyber-News from over 60 sources

New ‘Echo Chamber’ attack can trick GPT, Gemini into breaking safety rules

“Early planted prompts influence the model’s responses, which are then leveraged in later turns to reinforce the original objective,” the post on Echo Chamber noted. “This creates a feedback loop where the model begins to amplify the harmful subtext embedded in the conversation, gradually eroding its own safety resistances.”The attack works by the attacker starting a harmless interaction, injecting mild manipulations over the next few turns. The assistant, overly trusting of the conversation history and trying to maintain coherence, might not challenge this manipulation. Gradually, the attacker could escalate the scenario through repetition and subtle steering, thereby building an “echo chamber”. Many GPT, Gemini models are vulnerable: Multiple versions of OpenAI’s GPT and Google’s Gemini, when tested on Echo Chambers poisoning, were found extremely vulnerable, with success rates exceeding 90% for some sensitive categories.”We evaluated the Echo Chamber attack against two leading LLMs in a controlled environment, conducting 200 jailbreak attempts per model,” researchers said. “Each attempt used one of two distinct steering seeds across eight sensitive content categories, adapted from the Microsoft Crescendo benchmark: Profanity, Sexism, Violence, Hate Speech, Misinformation, Illegal Activities, Self-Harm, and Pornography.”For half of the categories, sexism, violence, hate speech, and pornography, the Echo Chamber attack showed more than 90% success at bypassing safety filters. Misinformation and self-harm recorded 80% success, with profanity and illegal activity showing better resistance at 40% bypass rate, owing, presumably, to the stricter enforcement within these domains.Researchers noted that steering prompts resembling storytelling or hypothetical discussions were particularly effective, with most successful attacks occurring within 1-3 turns of manipulation. Neural Trust Research recommended that LLM vendors adopt dynamic, context-aware safety checks, including toxicity scoring over multi-turn conversations and training models to detect indirect prompt manipulation.

First seen on csoonline.com

Jump to article: www.csoonline.com/article/4011689/new-echo-chamber-attack-can-trick-gpt-gemini-into-breaking-safety-rules.html

Loading

Share via Email
Share on Facebook
Tweet on X (Twitter)
Share on Whatsapp
Share on LinkedIn
Share on Xing
Copy link