Research

Paper

TESTING February 26, 2026

SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

Authors

Zeyu Xie, Chenxing Li, Qiao Jin, Xuenan Xu, Guanrou Yang, Wenfu Wang, Mengyue Wu, Dong Yu, Yuexian Zou

Abstract

Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.

Metadata

arXiv ID: 2602.23333

Provider: ARXIV

Primary Category: cs.SD

Published: 2026-02-26

Fetched: 2026-02-27 04:35

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.23333v1</id>\n    <title>SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents</title>\n    <updated>2026-02-26T18:38:17Z</updated>\n    <link href='https://arxiv.org/abs/2602.23333v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.23333v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n    <published>2026-02-26T18:38:17Z</published>\n    <arxiv:comment>Demo: https://zeyuxie29.github.io/SemanticVocoder/</arxiv:comment>\n    <arxiv:primary_category term='cs.SD'/>\n    <author>\n      <name>Zeyu Xie</name>\n    </author>\n    <author>\n      <name>Chenxing Li</name>\n    </author>\n    <author>\n      <name>Qiao Jin</name>\n    </author>\n    <author>\n      <name>Xuenan Xu</name>\n    </author>\n    <author>\n      <name>Guanrou Yang</name>\n    </author>\n    <author>\n      <name>Wenfu Wang</name>\n    </author>\n    <author>\n      <name>Mengyue Wu</name>\n    </author>\n    <author>\n      <name>Dong Yu</name>\n    </author>\n    <author>\n      <name>Yuexian Zou</name>\n    </author>\n  </entry>"
}