Paper
A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
Authors
Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate
Abstract
Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.14733v1</id>\n <title>A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding</title>\n <updated>2026-03-16T02:09:48Z</updated>\n <link href='https://arxiv.org/abs/2603.14733v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.14733v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n <published>2026-03-16T02:09:48Z</published>\n <arxiv:primary_category term='cs.CV'/>\n <author>\n <name>Yue Zhang</name>\n </author>\n <author>\n <name>Liqiang Jing</name>\n </author>\n <author>\n <name>Jia Li</name>\n </author>\n <author>\n <name>Yapeng Tian</name>\n </author>\n <author>\n <name>Xinya Du</name>\n </author>\n <author>\n <name>Yunhui Guo</name>\n </author>\n <author>\n <name>Vibhav Gogate</name>\n </author>\n </entry>"
}