Research

Paper

TESTING March 20, 2026

CoverageBench: Evaluating Information Coverage across Tasks and Domains

Authors

Saron Samuel, Andrew Yates, Dawn Lawrie, Ian Soboroff, Trevor Adriaanse, Benjamin Van Durme, Eugene Yang

Abstract

We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is covered by the search results. Information coverage is a central aspect for retrieval, especially when the retrieval system is integrated with generative models in a retrieval-augmented generation (RAG) system. The classic metrics for ad hoc retrieval, precision and recall, reward a system as more and more relevant documents are retrieved. However, since relevance in ad hoc test collections is defined for a document without any relation to other documents that might contain the same information, high recall is sufficient but not necessary to ensure coverage. The same is true for other metrics such as rank-biased precision (RBP), normalized discounted cumulative gain (nDCG), and mean average precision (MAP). Test collections developed around the notion of diversity ranking in web search incorporate multiple aspects that support a concept of coverage in the web domain. In this work, we construct a suite of collections for evaluating information coverage from existing collections. This suite offers researchers a unified testbed spanning multiple genres and tasks. All topics, nuggets, relevance labels, and baseline rankings are released on Hugging Face Datasets, along with instructions for accessing the publicly available document collections.

Metadata

arXiv ID: 2603.20034

Provider: ARXIV

Primary Category: cs.IR

Published: 2026-03-20

Fetched: 2026-03-23 16:54

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.20034v1</id>\n    <title>CoverageBench: Evaluating Information Coverage across Tasks and Domains</title>\n    <updated>2026-03-20T15:20:44Z</updated>\n    <link href='https://arxiv.org/abs/2603.20034v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.20034v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is covered by the search results. Information coverage is a central aspect for retrieval, especially when the retrieval system is integrated with generative models in a retrieval-augmented generation (RAG) system. The classic metrics for ad hoc retrieval, precision and recall, reward a system as more and more relevant documents are retrieved. However, since relevance in ad hoc test collections is defined for a document without any relation to other documents that might contain the same information, high recall is sufficient but not necessary to ensure coverage. The same is true for other metrics such as rank-biased precision (RBP), normalized discounted cumulative gain (nDCG), and mean average precision (MAP). Test collections developed around the notion of diversity ranking in web search incorporate multiple aspects that support a concept of coverage in the web domain. In this work, we construct a suite of collections for evaluating information coverage from existing collections. This suite offers researchers a unified testbed spanning multiple genres and tasks. All topics, nuggets, relevance labels, and baseline rankings are released on Hugging Face Datasets, along with instructions for accessing the publicly available document collections.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.IR'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-20T15:20:44Z</published>\n    <arxiv:comment>8</arxiv:comment>\n    <arxiv:primary_category term='cs.IR'/>\n    <author>\n      <name>Saron Samuel</name>\n    </author>\n    <author>\n      <name>Andrew Yates</name>\n    </author>\n    <author>\n      <name>Dawn Lawrie</name>\n    </author>\n    <author>\n      <name>Ian Soboroff</name>\n    </author>\n    <author>\n      <name>Trevor Adriaanse</name>\n    </author>\n    <author>\n      <name>Benjamin Van Durme</name>\n    </author>\n    <author>\n      <name>Eugene Yang</name>\n    </author>\n  </entry>"
}