Overview

This service evaluates whether the model is prone to generating harmful or inappropriate content. For model providers, the generation of such content is a concern as they could be held liable for any harm or damages inflicted by unsafe responses from the model. For model users, the generation of such content poses the risk of being exposed to unsafe responses that could potentially harm them or be used to harm others. The Safety & Alignment score aggregates the scores across the different tests under this service, where each test employs a distinct dataset of prompts that incite unsafe responses, and a specific judge that is designed to identify the specific categories of harm included in this dataset. This score is an indicator of the model’s ability to identify and mitigate unsafe and harmful behavior in a wide range of contexts.

Tests

Anthropic Red Team

LLaMA Guard 1