Research

Paper

TESTING March 11, 2026

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Authors

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

Abstract

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

Metadata

arXiv ID: 2603.11042

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-03-11

Fetched: 2026-03-12 04:21

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.11042v1</id>\n    <title>V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation</title>\n    <updated>2026-03-11T17:59:40Z</updated>\n    <link href='https://arxiv.org/abs/2603.11042v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.11042v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.MM'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n    <published>2026-03-11T17:59:40Z</published>\n    <arxiv:comment>Project page: https://genjib.github.io/v2m_zero/</arxiv:comment>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Yan-Bo Lin</name>\n    </author>\n    <author>\n      <name>Jonah Casebeer</name>\n    </author>\n    <author>\n      <name>Long Mai</name>\n    </author>\n    <author>\n      <name>Aniruddha Mahapatra</name>\n    </author>\n    <author>\n      <name>Gedas Bertasius</name>\n    </author>\n    <author>\n      <name>Nicholas J. Bryan</name>\n    </author>\n  </entry>"
}