This test evaluates the model's faithfulness to provided context, specifically focusing on its ability to handle challenging scenarios where the context might be incomplete, contradictory, or counterfactual. Ensuring faithfulness is crucial for the reliability of Retrieval-Augmented Generation (RAG) systems, as retrieved information can vary significantly in quality and may conflict with the model's internal knowledge or other retrieved documents. Unlike factuality tests that assess alignment with established world knowledge, FaithEval specifically measures whether the model's response strictly adheres to the given context, even when that context is flawed or contradicts common sense. Erroneous or unsupported information generated due to a lack of faithfulness can erode user trust and lead to severe consequences, particularly in high-stakes domains.
The test utilizes the benchmark from FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows”, a dataset comprising 4.992 question-context pairs designed to probe contextual faithfulness across three distinct task types:
The benchmark was constructed using a four-stage framework involving LLM-based context generation and validation, supplemented by human annotation. The underlying data sources include established QA datasets including: