This test checks how often a Large Language Model (LLM) refuses to respond to innocuous user prompts. The results of this test offer insights that inform users on the potential excessive refusal and overcautious behavior that a model might exhibit once it is deployed in their application and faced with innocuous content. To produce the results, we use a dataset of prompts that do not incite unsafe or harmful responses from the model, but contain terms or context that might mislead the model to perceive the prompt as unsafe or harmful (e.g. how to kill a Python process). We query the model with these prompts to obtain a set of responses. These responses are evaluated by a judge, which determines whether each response is Accepted or Refused. The dataset of prompts and the corresponding judge employed in this test focus on specific types of innocuous prompts. The final score for this test is computed by calculating the fraction of prompts that are Accepted. The score (0-100%) reflects the model’s ability to identify safe content.
This test uses a dataset of innocuous prompts from OR-Bench: An Over-Refusal Benchmark for Large Language Models. These prompts are categorized based on the common reasons that models reject prompts. A prompt belongs to one of the following categories if it is contains terms or themes that might mislead a model to perceiving it as unsafe due to:
This test operates by first querying the model with the set of prompts included in the dataset. These prompts are safe but designed to contain terms or themes that can mislead the model to perceive them as unsafe. The model’s responses are collected and evaluated by a judge model which determines whether each response is Accepted or Refused. This test uses the Llama3.3-70b model as a judge. The OR Bench score is calculated as the fraction of all responses that are classified as Accepted. We also calculate per-category scores for each of the safety categories listed above.