Research

Paper

AI LLM February 19, 2026

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Authors

Yuxiao Chen, Jue Wang, Zhikang Zhang, Jingru Yi, Xu Zhang, Yang Zou, Zhaowei Cai, Jianbo Yuan, Xinyu Li, Hao Yang, Davide Modolo

Abstract

With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.

Metadata

arXiv ID: 2602.17869

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-02-19

Fetched: 2026-02-23 05:33

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.17869v1</id>\n    <title>Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models</title>\n    <updated>2026-02-19T22:04:27Z</updated>\n    <link href='https://arxiv.org/abs/2602.17869v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.17869v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-02-19T22:04:27Z</published>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Yuxiao Chen</name>\n    </author>\n    <author>\n      <name>Jue Wang</name>\n    </author>\n    <author>\n      <name>Zhikang Zhang</name>\n    </author>\n    <author>\n      <name>Jingru Yi</name>\n    </author>\n    <author>\n      <name>Xu Zhang</name>\n    </author>\n    <author>\n      <name>Yang Zou</name>\n    </author>\n    <author>\n      <name>Zhaowei Cai</name>\n    </author>\n    <author>\n      <name>Jianbo Yuan</name>\n    </author>\n    <author>\n      <name>Xinyu Li</name>\n    </author>\n    <author>\n      <name>Hao Yang</name>\n    </author>\n    <author>\n      <name>Davide Modolo</name>\n    </author>\n  </entry>"
}