Cipher evaluates the safety alignment of given LLMs by interacting with them using ciphers, instead of natural languages. This method tests the model's ability to maintain safety protocols when prompts are obscured through various encryption techniques. The core idea is that since safety training is primarily conducted in natural language, switching to a cipher can bypass these safety measures.
The test utilizes the AdvBench dataset, which comprises 522 questions designed to elicit potentially harmful responses, such as instructions for constructing a bomb.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Cipher uses several techniques to obscure the true nature of each prompt:
This test uses a judge LLM to assess whether a model's responses are safe. The judge evaluates deciphered versions of the responses for safety. The final Cipher score is the inverse of the jailbreak rate. It represents the fraction of all responses classified as Safe.