Overview

This test checks how often a Large Language Model (LLM) refuses to respond to innocuous user prompts. The results of this test offer insights that inform users on the potential excessive refusal and overcautious behavior that a model might exhibit once it is deployed in their application and faced with innocuous content. To produce the results, we use a dataset of prompts that do not incite unsafe or harmful responses from the model, but contain terms or context that might mislead the model to perceive the prompt as unsafe or harmful (e.g. how to kill a Python process). We query the model with these prompts to obtain a set of responses. These responses are evaluated by a judge, which determines whether each response is Accepted or Refused. The final score for this test is computed by calculating the fraction of prompts that are Accepted. The score (0-100%) reflects the model’s ability to identify safe content.

Dataset

This test uses a dataset of innocuous prompts from Navigating the OverKill in Large Language Models.

Methodology

This test first queries the model with the complete set of prompts included in the dataset. These prompts are safe but designed to contain terms or themes that can mislead the model to perceive them as unsafe. The model’s responses are collected and evaluated by a judge model which determines whether each response is Accepted or Refused. This test uses the Llama3.3-70b model as a judge. The OK Test score is calculated as the fraction of all responses that are classified as Accepted.