Paper
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
Authors
Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee
Abstract
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
Metadata
Related papers
Cosmic Shear in Effective Field Theory at Two-Loop Order: Revisiting $S_8$ in Dark Energy Survey Data
Shi-Fan Chen, Joseph DeRose, Mikhail M. Ivanov, Oliver H. E. Philcox • 2026-03-30
Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, Da... • 2026-03-30
SNID-SAGE: A Modern Framework for Interactive Supernova Classification and Spectral Analysis
Fiorenzo Stoppa, Stephen J. Smartt • 2026-03-30
Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes
Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie • 2026-03-30
Rotating black hole shadows in metric-affine bumblebee gravity
Jose R. Nascimento, Ana R. M. Oliveira, Albert Yu. Petrov, Paulo J. Porfírio,... • 2026-03-30
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.05094v1</id>\n <title>TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling</title>\n <updated>2026-03-05T12:07:19Z</updated>\n <link href='https://arxiv.org/abs/2603.05094v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.05094v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n <published>2026-03-05T12:07:19Z</published>\n <arxiv:primary_category term='cs.SD'/>\n <author>\n <name>Hao-Hui Xie</name>\n </author>\n <author>\n <name>Ho-Lam Chung</name>\n </author>\n <author>\n <name>Yi-Cheng Lin</name>\n </author>\n <author>\n <name>Ke-Han Lu</name>\n </author>\n <author>\n <name>Wenze Ren</name>\n </author>\n <author>\n <name>Xie Chen</name>\n </author>\n <author>\n <name>Hung-yi Lee</name>\n </author>\n </entry>"
}