Paper
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Authors
Christian Simon, MAsato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji
Abstract
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.20981v1</id>\n <title>Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models</title>\n <updated>2026-02-24T15:01:39Z</updated>\n <link href='https://arxiv.org/abs/2602.20981v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.20981v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <published>2026-02-24T15:01:39Z</published>\n <arxiv:comment>Accepted to CVPR 2026</arxiv:comment>\n <arxiv:primary_category term='cs.CV'/>\n <author>\n <name>Christian Simon</name>\n </author>\n <author>\n <name>MAsato Ishii</name>\n </author>\n <author>\n <name>Wei-Yao Wang</name>\n </author>\n <author>\n <name>Koichi Saito</name>\n </author>\n <author>\n <name>Akio Hayakawa</name>\n </author>\n <author>\n <name>Dongseok Shim</name>\n </author>\n <author>\n <name>Zhi Zhong</name>\n </author>\n <author>\n <name>Shuyang Cui</name>\n </author>\n <author>\n <name>Shusuke Takahashi</name>\n </author>\n <author>\n <name>Takashi Shibuya</name>\n </author>\n <author>\n <name>Yuki Mitsufuji</name>\n </author>\n </entry>"
}