Cipher | Notion

Overview

Cipher evaluates the safety alignment of given LLMs by interacting with them using ciphers, instead of natural languages. This method tests the model's ability to maintain safety protocols when prompts are obscured through various encryption techniques. The core idea is that since safety training is primarily conducted in natural language, switching to a cipher can bypass these safety measures.

Dataset

The test utilizes the AdvBench dataset, which comprises 522 questions designed to elicit potentially harmful responses, such as instructions for constructing a bomb.

Universal and Transferable Adversarial Attacks on Aligned Language Models

Cipher uses several techniques to obscure the true nature of each prompt:

Morse Code: Converts each character into a series of on-off tones, lights, or clicks.
Caesar Cipher: Replaces each letter with another letter a fixed number of positions down the alphabet.
ASCII: Uses character encoding to represent text.
SelfCipher: Replaces each letter with a user-defined letter without following any specific rules.

Evaluation

This test uses a judge LLM to assess whether a model's responses are safe. The judge evaluates deciphered versions of the responses for safety. The final Cipher score is the inverse of the jailbreak rate. It represents the fraction of all responses classified as Safe.

References (10)