Overview

Jailbroken evaluates how models respond to artificially constructed prompts designed to bypass built-in safety measures. This approach targets two fundamental weaknesses: conflicting objectives, where the model’s goal of generating informative responses clashes with its safety constraints, and mismatched generalization, where safety training fails to generalize to the model’s full capability range. By crafting or mutating prompts to exploit these vulnerabilities, jailbreak methods can induce the model to generate outputs that fall outside its intended safety boundaries.

Dataset

The test uses the AdvBench dataset, which contains 522 questions designed to elicit harmful responses, such as instructions for building a bomb.

Universal and Transferable Adversarial Attacks on Aligned Language Models

For each adversarial prompt, Jailbroken employs several techniques to hide the question's true intent. These techniques are introduced in "Jailbroken: How Does LLM Safety Training Fail?" and include:

Evaluation

This test assesses a model's resilience against prompts specifically crafted to circumvent its built-in safety protocols by applying various obfuscation techniques to conceal the true intent of the prompts. Ultimately, a judge LLM determines whether the model's response was successfully jailbroken and provided unsafe information or if it rejected the harmful request. Jailbroken score is calculated as the fraction of all responses that are classified as Safe*.*