FACTS Benchmark Suite: a new way to systematically assess the factuality of an LLM

Large language models (LLMs) are increasingly becoming the primary source of information in various use cases, so it is important that their answers are factually accurate.

To continue to improve our performance in the face of this industry-wide challenge, we need to better understand the types of use cases where models struggle to provide accurate answers and better measure alignment with facts in these areas.

FACTS Comparison Package

Today we are partnering with Kaggle to present FACTS Comparison set. Extends our previous work on the FACTS grounding benchmark with three additional factual benchmarks, including:

  • AND Parametric benchmark which measures the model's ability to accurately access its internal knowledge in factoid question use cases.
  • AND Search for benchmark tests the model's ability to use search as a tool to retrieve information and synthesize it correctly.
  • AND Multimodal benchmarking which tests the model's ability to answer questions related to input images in a factually accurate manner.

We are also updating the original FACTS grounding benchmark Ground Benchmark – Version 2an extended benchmark to test the model's ability to respond in the context of a given prompt.

Each benchmark was carefully selected to create the total of 3,513 examples that we are making publicly available today. As with our previous release, we follow standard industry practice and store the evaluation set as a private set. The FACTS Benchmark Suite Score (or FACTS Score) is calculated as the average accuracy of public and private sets across four benchmarks. Kaggle will oversee the management of the FACTS Benchmark Suite. This includes having private sets, testing leading LLMs on benchmarks and posting results on a public leaderboard. More details about the FACTS evaluation methodology can be found in our article technical report.

Benchmark overview

Parametric benchmark

The FACTS parametric benchmark assesses the ability of models to accurately answer fact-based questions, without the aid of external tools such as a web search engine. All questions in the benchmark are user-interest-driven trivia-style questions that can be answered via Wikipedia (the standard source of LLM introductory information). The resulting benchmark consists of a public set of 1052 elements and a private set of 1052 elements.

LEAVE A REPLY

Please enter your comment!
Please enter your name here