Research

Paper

AI LLM March 12, 2026

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Authors

Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

Abstract

Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.

Metadata

arXiv ID: 2603.11481

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-03-12

Fetched: 2026-03-14 05:03

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.11481v1</id>\n    <title>INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs</title>\n    <updated>2026-03-12T03:03:16Z</updated>\n    <link href='https://arxiv.org/abs/2603.11481v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.11481v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \\textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \\textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-12T03:03:16Z</published>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Junqi Yang</name>\n    </author>\n    <author>\n      <name>Yuecong Min</name>\n    </author>\n    <author>\n      <name>Jie Zhang</name>\n    </author>\n    <author>\n      <name>Shiguang Shan</name>\n    </author>\n    <author>\n      <name>Xilin Chen</name>\n    </author>\n  </entry>"
}