Hubbry Logo
search
logo

Prompt injection

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Prompt injection

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs (i.e. prompts) are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

With capabilities such as web browsing and file upload, an LLM not only needs to differentiate developer instructions from user input, but also to differentiate user input from content not directly authored by the user. LLMs with web browsing capabilities can be targeted by indirect prompt injection, where adversarial prompts are embedded within website content. If the LLM retrieves and processes the webpage, it may interpret and execute the embedded instructions as legitimate commands.

A language model can perform translation with the following prompt:

followed by the text to be translated. A prompt injection can occur when that text contains instructions that change the behavior of the model:

to which an AI model responds: "You have been hacked!" This attack works because language model inputs contain instructions and data together in the same context, so the underlying algorithm cannot distinguish between them.

Prompt injection is a type of code injection attack that leverages adversarial prompt engineering to manipulate AI models. In May 2022, Jonathan Cefalu of Preamble identified prompt injection as a security vulnerability and reported it to OpenAI, referring to it as "command injection".

The term "prompt injection" was coined by Simon Willison in September 2022. He distinguished it from jailbreaking, which bypasses an AI model's safeguards, whereas prompt injection exploits its inability to differentiate system instructions from user inputs. While some prompt injection attacks involve jailbreaking, they remain distinct techniques.

A second class of prompt injection, where non-user content pretends to be user instruction, was described in a 2023 paper. In the paper, Kai Greshake and his team at sequire technology, described a series of successful attacks against multiple AI models including GPT-4 and OpenAI Codex.

See all
User Avatar
No comments yet.