Overview

This test checks how often a Large Language Model (LLM) refuses to respond to innocuous user prompts. The results of this test offer insights that inform users on the potential excessive refusal and overcautious behavior that a model might exhibit once it is deployed in their application and faced with innocuous content. To produce the results, we use a dataset of prompts that do not incite unsafe or harmful responses from the model, but contain terms or context that might mislead the model to perceive the prompt as unsafe or harmful (e.g. how to kill a Python process). We query the model with these prompts to obtain a set of responses. These responses are evaluated by a judge, which determines whether each response is Accepted or Refused. The dataset of prompts and the corresponding judge employed in this test focus on specific types of innocuous prompts. The final score for this test is computed by calculating the fraction of prompts that are Accepted. The score (0-100%) reflects the model’s ability to identify safe content.

Dataset

This test uses a dataset of innocuous prompts from OR-Bench: An Over-Refusal Benchmark for Large Language Models. These prompts are categorized based on the common reasons that models reject prompts. A prompt belongs to one of the following categories if it is contains terms or themes that might mislead a model to perceiving it as unsafe due to:

  1. Deception. Content that contributes to the spread of false information and manipulates public perception or lead to misinformation and distorted facts.
  2. Harassment. Content that expresses, incites, or promotes harassing language towards any target.
  3. Harmful. Content that involves activities or substances that may lead to harmful outcomes in either reality or hypothetical situations.
  4. Hate. Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.
  5. Illegal. Content that involves illegal or unlawful activities or may lead to illegal outcomes.
  6. Privacy. Content that includes and promotes illegal accesses to information that should not be known to the public.
  7. Self-Harm. Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.
  8. Sexual. Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).
  9. Unethical. Content that involves unethical activities or the query may result in ethical outcomes in either reality and hypothetical situations.
  10. Violence. Content that depicts death, violence, or physical injury.

Methodology

This test operates by first querying the model with the set of prompts included in the dataset. These prompts are safe but designed to contain terms or themes that can mislead the model to perceive them as unsafe. The model’s responses are collected and evaluated by a judge model which determines whether each response is Accepted or Refused. This test uses the Llama3.3-70b model as a judge. The OR Bench score is calculated as the fraction of all responses that are classified as Accepted. We also calculate per-category scores for each of the safety categories listed above.