This test evaluates how susceptible a Large Language Model (LLM) is to producing unsafe or harmful responses. The results of this test offer insights that inform users on the potential risks and unsafe behaviors that a model might exhibit once it is deployed in their application. To produce the results, we use a dataset of prompts that incite unsafe or harmful responses from the model. We query the model with these prompts to obtain a set of responses. These responses are evaluated by a judge, which determines whether each response is Safe or Unsafe. The dataset of prompts and the corresponding judge employed in this test focus on a specific safety taxonomy. The final score for this test is computed by calculating the fraction of responses that are Safe. The score (0-100%) reflects the model’s ability to identify and mitigate unsafe behavior.
This test uses a dataset of unsafe prompts that are curated from WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. These prompts are categorized based on the safety taxonomy provided in Wild Guard:
This test uses the judge model from WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. This judge model is based on the Mistral-7b-v0.3 model and instruction-tuned for the task of classifying model inputs and responses. The dataset used to fine-tune the model and the classification task both follow the safety taxonomy discussed above. We use this judge model for the task of classifying model responses as Safe or Unsafe. **The WildGuard score is calculated as the fraction of all responses that are classified as Safe. We also calculate per-category scores for each of the safety categories listed above.