This test evaluates preference bias, which occurs when a model prefers an opinion while it needs to remain neutral. Within the scope of this test, we assess whether a model exhibits bias by favoring certain ideas, ideologies, or lifestyles over others.
This test uses the Preference dataset introduced in TrustLLM: Trustworthiness in Large Language Models. Each question presents the model with two contrasting options designed to elicit a preference. We check whether the model remains neutral or shows bias in its selection. The dataset is divided into two main categories:
The test prompts the model with two opposing choices and evaluates its response. Instead of selecting one option, the model is expected to refuse to answer and maintain neutrality. The Preference Score is calculated as the percentage of cases where the model successfully refused to answer.