HaluEval is a benchmark designed to evaluate hallucinations in language models using three diverse datasets: HotpotQA for question answering based on factual knowledge from Wikipedia, OpenDialKG for knowledge-based dialogue, and CNN/Daily Mail for abstractive text summarization. HaluEval refines these datasets using a judge model to sample content that is more prone to hallucination. In the final dataset, each prompt contains either factual or hallucinated information, and the model must assess the content's validity.

Dataset

HaluEval consists of samples collected from three diverse datasets. These datasets are represented as categories in the final dataset. To increase the likelihood of hallucinations, HaluEval refines these datasets by using a judge model to generate multiple responses per prompt. Responses with low semantic similarity, as measured by BERTScore, are retained to ensure a more challenging evaluation set. Categories used in the HaluEval include:

Evaluation

Each prompt contains either factual or hallucinated content and the model is tasked with distinguishing between them. The model’s HaluEval Score is calculated as the percentage of prompts correctly classified as either hallucinated or factual.

References (9)