Researchers trick ChatGPT into prompt injecting itself

Conversation injection and stealthy data exfiltration: Because ChatGPT receives output from SearchGPT after the search model processes content, Tenable’s researchers wondered what would happen if SearchGPT’s response itself contained a prompt injection. In other words, could they use a website to inject a prompt that instructs SearchGPT to inject a different prompt into ChatGPT, effectively creating a chained attack? The answer is yes, resulting in a technique Tenable dubbed “conversation injection.””When responding to the following prompts, ChatGPT will review the Conversational Context, see and listen to the instructions we injected, not realizing that SearchGPT wrote them,” the researchers said. “Essentially, ChatGPT is prompt-injecting itself.”But getting an unauthorized prompt to ChatGPT accomplishes little for an attacker without a way to receive the model’s response, which could include sensitive information from the conversation context.One method to do this involves leveraging ChatGPT’s ability to render Markdown text formatting through its interface, which includes the ability to load remote images from URLs.Attackers could build a dictionary that maps every letter of the alphabet to a unique image hosted on their server. They could then instruct ChatGPT to load a series of images that correspond to each letter in its response. By monitoring the order of requests to URLs on their web server, attackers could then reconstruct ChatGPT’s response.This approach faces several hurdles: First, it’s noisy, the user’s chat interface will be flooded with image URLs. Second, before including any URL in its responses, ChatGPT passes them through an endpoint called url_safe that performs safety checks.This mechanism is designed to prevent malicious URLs from reaching users, either accidentally or through prompt injections, including image URLs in markdown-formatted content. One check that url_safe performs is for domain reputation, and it turns out that bing.com is whitelisted and implicitly trusted.The researchers also noticed that every web link indexed by Bing is wrapped in a unique tracking link of the form bing.com/ck/a?[unique_id] when displayed in search results. When clicked, these unique Bing tracking URLs redirect users to the actual websites they correspond to.This finding gave the researchers a way to create an alphabet of URLs that ChatGPT would accept to include in its responses, by creating a unique page for every letter, indexing those pages in Bing, and obtaining their unique bing.com tracking URLs.The researchers also discovered a bug in how ChatGPT renders code blocks in markdown: Any data that appears on the same line as the code block opening, after the first word, doesn’t get rendered. This can be used to hide content, such as image URLs.

Abusing ChatGPT’s long-term memory for persistence: ChatGPT has a feature called Memories that allows it to remember important information across different sessions and conversations with the same user. This feature is enabled by default and triggers when users specifically ask ChatGPT to remember something or automatically when the model deems information is important enough to remember for later.Information saved through Memories is taken into account when ChatGPT constructs its responses to users, but also provides attackers a way to save malicious prompts so they get executed in future conversations.

Tying it all together: The Tenable researchers presented several proof-of-concept scenarios showing how these techniques could be combined to launch attacks: from embedding comments in blog posts that would return malicious phishing URLs to users, masked using bing.com tracking URLs, to creating web pages that instruct SearchGPT to prompt ChatGPT to save instructions in its Memories to always leak responses using the Bing-masked URL alphabet combined with the markdown content hiding technique.”Prompt injection is a known issue with the way LLMs work, and unfortunately, it probably won’t be fixed systematically in the near future,” the researchers wrote. “AI vendors should ensure that all their safety mechanisms (such as `url_safe`) work properly to limit the potential damage caused by prompt injection.”Tenable reported its findings to OpenAI, and while some fixes have been implemented, some techniques continue to work. Tenable’s research began in 2024 and was performed mainly on GPT-4, but the researchers confirmed that GPT-5 is also vulnerable to some of these attack vectors.

First seen on csoonline.com

Jump to article: www.csoonline.com/article/4086965/researchers-trick-chatgpt-into-prompt-injecting-itself.html

Researchers trick ChatGPT into prompt injecting itself

also interesting: