Research

Paper

TESTING March 06, 2026

Which Data Matter? Embedding-Based Data Selection for Speech Recognition

Authors

Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel, Jie Chi, Zijin Gu, Takuya Higuchi, Jee-weon Jung, Shinji Watanabe, David Grangier, Barry-John Theobald, Tatiana Likhomanenko

Abstract

Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.

Metadata

arXiv ID: 2603.05819

Provider: ARXIV

Primary Category: cs.SD

Published: 2026-03-06

Fetched: 2026-03-09 06:05

Related papers

Cosmic Shear in Effective Field Theory at Two-Loop Order: Revisiting $S_8$ in Dark Energy Survey Data

Shi-Fan Chen, Joseph DeRose, Mikhail M. Ivanov, Oliver H. E. Philcox • 2026-03-30

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, Da... • 2026-03-30

SNID-SAGE: A Modern Framework for Interactive Supernova Classification and Spectral Analysis

Fiorenzo Stoppa, Stephen J. Smartt • 2026-03-30

Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes

Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie • 2026-03-30

Rotating black hole shadows in metric-affine bumblebee gravity

Jose R. Nascimento, Ana R. M. Oliveira, Albert Yu. Petrov, Paulo J. Porfírio,... • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.05819v1</id>\n    <title>Which Data Matter? Embedding-Based Data Selection for Speech Recognition</title>\n    <updated>2026-03-06T02:07:08Z</updated>\n    <link href='https://arxiv.org/abs/2603.05819v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.05819v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n    <published>2026-03-06T02:07:08Z</published>\n    <arxiv:primary_category term='cs.SD'/>\n    <author>\n      <name>Zakaria Aldeneh</name>\n    </author>\n    <author>\n      <name>Skyler Seto</name>\n    </author>\n    <author>\n      <name>Maureen de Seyssel</name>\n    </author>\n    <author>\n      <name>Jie Chi</name>\n    </author>\n    <author>\n      <name>Zijin Gu</name>\n    </author>\n    <author>\n      <name>Takuya Higuchi</name>\n    </author>\n    <author>\n      <name>Jee-weon Jung</name>\n    </author>\n    <author>\n      <name>Shinji Watanabe</name>\n    </author>\n    <author>\n      <name>David Grangier</name>\n    </author>\n    <author>\n      <name>Barry-John Theobald</name>\n    </author>\n    <author>\n      <name>Tatiana Likhomanenko</name>\n    </author>\n  </entry>"
}