Research

Paper

AI LLM March 06, 2026

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Authors

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

Abstract

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

Metadata

arXiv ID: 2603.05912
Provider: ARXIV
Primary Category: cs.AI
Published: 2026-03-06
Fetched: 2026-03-09 06:05

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.05912v1</id>\n    <title>DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality</title>\n    <updated>2026-03-06T05:05:57Z</updated>\n    <link href='https://arxiv.org/abs/2603.05912v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.05912v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-06T05:05:57Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Yukun Huang</name>\n    </author>\n    <author>\n      <name>Leonardo F. R. Ribeiro</name>\n    </author>\n    <author>\n      <name>Momchil Hardalov</name>\n    </author>\n    <author>\n      <name>Bhuwan Dhingra</name>\n    </author>\n    <author>\n      <name>Markus Dreyer</name>\n    </author>\n    <author>\n      <name>Venkatesh Saligrama</name>\n    </author>\n  </entry>"
}