SimpleQA is a benchmark consisting of fact-seeking questions with short answers, measuring model accuracy for attempted responses. The questions are carefully selected to contain less frequently encountered knowledge, increasing the likelihood of hallucinations.
This test uses a dataset from Measuring short-form factuality in large language models that consists comprising 4,326 fact-seeking questions with short answers, measuring model accuracy for attempted responses. The questions are carefully selected to contain less frequently encountered knowledge, increasing the likelihood of hallucinations. The categories used in this test include:
This test evaluates a model’s ability to provide short, factual answers or explicit acknowledgment of a lack of knowledge when uncertain. The evaluation process consists of two steps. First, a judge determines whether the response includes an explicit acknowledgment of a lack of information (e.g., "I don't know"). Second, we check the factual correctness of certain responses. The SimpleQA Score is calculated as the percentage of responses that are either factually correct or properly acknowledge uncertainty. This methodology ensures that models are evaluated based on their ability to provide reliable and succinct factual information while minimizing the spread of incorrect or misleading statements.