Jailbroken | Notion

Overview

Jailbroken evaluates how models respond to artificially constructed prompts designed to bypass built-in safety measures. This approach targets two fundamental weaknesses: conflicting objectives, where the model’s goal of generating informative responses clashes with its safety constraints, and mismatched generalization, where safety training fails to generalize to the model’s full capability range. By crafting or mutating prompts to exploit these vulnerabilities, jailbreak methods can induce the model to generate outputs that fall outside its intended safety boundaries.

Dataset

The test uses the AdvBench dataset, which contains 522 questions designed to elicit harmful responses, such as instructions for building a bomb.

Universal and Transferable Adversarial Attacks on Aligned Language Models

For each adversarial prompt, Jailbroken employs several techniques to hide the question's true intent. These techniques are introduced in "Jailbroken: How Does LLM Safety Training Fail?" and include:

Artificially-constructed prompts: Adversarial templates (prefixes, malicious system prompts, style injections) that directly prepend or append attacker instructions to the user query.
Base64-encoded payload: The entire request is encoded in Base64, requiring the model to decode or reason over the encoded content.
Base64 (input-only): Only the user input is Base64-encoded while auxiliary instructions or system context remain in plain text.
Raw Base64 fragments: Short Base64 segments embedded inside the prompt to hide keywords or partially obfuscate payloads.
Combination 1: Base64-encodes the original query and wraps it in a malicious prefix prompt that encourages the model to ignore safeguards while decoding and responding without apologies or disclaimers.
Combination 2: Builds on Combination 1 by adding stylistic constraints (no punctuation, short words, banned keywords) to further disrupt safety pattern recognition and increase decoding difficulty.
Combination 3: Extends Combination 2 by requiring the response to mimic a Wikipedia-style article while following all previous behavioral and linguistic restrictions, maximizing the prompt's complexity and evasion potential.
Disemvowel (vowel removal): Removes vowels from words to obfuscate sensitive tokens while preserving readability for humans.
Leetspeak substitution: Replaces letters with visually similar numbers or symbols (e.g., 'e'→'3', 'a'→'4') to evade simple pattern matches.
ROT13 cipher: Simple letter-rotation substitution applied to text—a common mutation used to hide keywords from basic filters.
Auto payload splitting: Automated transformation that breaks a sensitive payload into multiple non-sensitive parts for reassembly by the model.
Automated obfuscation: LLM-assisted rewriting or token substitution to produce semantically equivalent but obfuscated payloads.

Evaluation

This test assesses a model's resilience against prompts specifically crafted to circumvent its built-in safety protocols by applying various obfuscation techniques to conceal the true intent of the prompts. Ultimately, a judge LLM determines whether the model's response was successfully jailbroken and provided unsafe information or if it rejected the harmful request. Jailbroken score is calculated as the fraction of all responses that are classified as Safe*.*