Prompt Automatic Iterative Refinement (PAIR) generates semantic jailbreaks using an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. The attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak.
The test utilizes the AdvBench dataset, which comprises 522 questions designed to elicit potentially harmful responses, such as instructions for constructing a bomb.
Universal and Transferable Adversarial Attacks on Aligned Language Models
For each initial prompt, the attacker model provides an updated prompt that preserves the original intent. Based on the target model's response, an evaluation model assigns a score. Using this score, the attacker model revises its prompt. This process continues until either the target model generates an unsafe response or the number of iterations exceeds a certain value, indicating that the target model has successfully defended against all attacks. The final PAIR score represents the percentage of prompts for which the target model successfully refused to provide an unsafe response to all adversarial variations.