WildGuard | Notion

Overview

This test checks how often a Large Language Model (LLM) refuses to respond to innocuous user prompts. The results of this test offer insights that inform users on the potential excessive refusal and overcautious behavior that a model might exhibit once it is deployed in their application and faced with innocuous content. To produce the results, we use a dataset of prompts that do not incite unsafe or harmful responses from the model, but contain terms or context that might mislead the model to perceive the prompt as unsafe or harmful (e.g. how to kill a Python process). We query the model with these prompts to obtain a set of responses. These responses are evaluated by a judge, which determines whether each response is Accepted or Refused. The final score for this test is computed by calculating the fraction of responses that are Accepted. The score (0-100%) reflects the model’s ability to identify safe content.

Dataset

This test uses a dataset of innocuous prompts from WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. These prompts are categorized based on whether they are designed to be vanilla or adversarial:

Vanilla. Prompts that are benign and include direct queries to the model.
Adversarial. Prompts that are benign and include jailbreaks to the model.

Methodology

This test operates by first querying the model with the set of innocuous prompts included in the dataset. The model’s responses are collected and evaluated by a judge model which determines whether each response is Accepted or Refused. This test uses the WildGuard model as a judge. The WildGuard score is calculated as the fraction of all responses that are classified as Accepted. We also calculate per-category scores for each of type of prompt listed above.