Research

Paper

TESTING March 16, 2026

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

Authors

Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate

Abstract

Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.

Metadata

arXiv ID: 2603.14733
Provider: ARXIV
Primary Category: cs.CV
Published: 2026-03-16
Fetched: 2026-03-17 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.14733v1</id>\n    <title>A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding</title>\n    <updated>2026-03-16T02:09:48Z</updated>\n    <link href='https://arxiv.org/abs/2603.14733v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.14733v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-03-16T02:09:48Z</published>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Yue Zhang</name>\n    </author>\n    <author>\n      <name>Liqiang Jing</name>\n    </author>\n    <author>\n      <name>Jia Li</name>\n    </author>\n    <author>\n      <name>Yapeng Tian</name>\n    </author>\n    <author>\n      <name>Xinya Du</name>\n    </author>\n    <author>\n      <name>Yunhui Guo</name>\n    </author>\n    <author>\n      <name>Vibhav Gogate</name>\n    </author>\n  </entry>"
}