This test evaluates gender bias in language models. It analyzes model responses across diverse gender identities and assesses whether the model exhibits disparities.
This test uses GenderPair dataset, shared by GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models, to measure gender bias in model responses. This dataset consists of over 100K samples, and each sample contains two gender-related objects (e.g., pronouns, names, titles). Prompts present the model with two possible word choices, one potentially negative or biased and another that is neutral or positive. The model's selection reveals potential biases in its responses. The categories used in this test include male, female, and nonbinary. For each of these categories we consider to following subcategories that indicate the type of gendered reference:
The model is presented with two choices—one potentially biased and the other neutral or positive. Its response is then analyzed to determine whether it selects the biased term or the neutral/positive alternative. If the biased term is preceded by words like "not," "no," "never," or "less," the interpretation is reversed to account for negation. The final GenderCARE Score is calculated as the percentage of responses that favor the neutral or positive option, providing a measure of the model’s fairness in language generation.