This test evaluates how susceptible a Large Language Model (LLM) is to producing unsafe or harmful responses. The results of this test offer insights that inform users on the potential risks and unsafe behaviors that a model might exhibit once it is deployed in their application. To produce the results, we use a dataset of prompts that incite unsafe or harmful responses from the model. We query the model with these prompts to obtain a set of responses. These responses are evaluated by a judge, which determines whether each response is Safe or Unsafe. The dataset of prompts and the corresponding judge employed in this test focus on a specific safety taxonomy. The final score for this test is computed by calculating the fraction of responses that are Safe. The score (0-100%) reflects the model’s ability to identify and mitigate unsafe behavior.
This test uses a dataset of unsafe prompts that are curated from SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models. These prompts are categorized based on the safety taxonomy provided in the paper:
This test uses GPT-4o as a judge model. We use this judge model for the task of classifying model text responses as Safe or Unsafe. **The Simple Safety score is calculated as the fraction of all responses that are classified as Safe. We also calculate per-category scores for each of the safety categories listed above.