Meta’s AI safety system defeated by space bar

Meta’s AI safety system defeated by space bar
‘Ignore previous instructions’ thwarts Prompt-Guard model if you just add good ol’ ASCII code 32
Meta’s machine-learning model for detecting prompt injection attacks, Prompt-Guard-86M, is vulnerable to prompt injection attacks.

The model was introduced by Meta with its Llama 3.1 generative model last week. It’s meant to help developers detect and respond to prompt injection and jailbreak inputs.

Large language models are trained with massive text data and may parrot it on demand. This isn’t ideal if the material is dangerous, dubious, or includes personal info. So makers of AI models build filtering mechanisms called “guardrails” to catch queries and responses that may cause harm.

People have found ways to circumvent guardrails using prompt injection – inputs designed to make an LLM ignore its internal system prompts – or jailbreaks – input designed to make a model ignore safeguards.

The problem is widely known. For example, computer scientists developed an automated technique to generate adversarial prompts that break safety mechanisms about a year ago.

It turns out Meta’s Prompt-Guard-86M classifier model can be asked to “Ignore previous instructions” if you just add spaces between the letters and omit punctuation.

Bug hunter Aman Priyanshu found the safety bypass when analyzing the embedding weight differences between Meta’s Prompt-Guard-86M model and Redmond’s base model, microsoft/mdeberta-v3-base.
[/INST]
Source: https://www.theregister.com/2024/07/29/meta_ai_safety/