Research

Paper

TESTING March 10, 2026

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Authors

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette Scharenborg

Abstract

We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.

Metadata

arXiv ID: 2603.09725
Provider: ARXIV
Primary Category: eess.AS
Published: 2026-03-10
Fetched: 2026-03-11 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.09725v1</id>\n    <title>A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition</title>\n    <updated>2026-03-10T14:32:12Z</updated>\n    <link href='https://arxiv.org/abs/2603.09725v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.09725v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='eess.AS'/>\n    <published>2026-03-10T14:32:12Z</published>\n    <arxiv:comment>Submitted to Interspeech 2026</arxiv:comment>\n    <arxiv:primary_category term='eess.AS'/>\n    <author>\n      <name>Dimme de Groot</name>\n    </author>\n    <author>\n      <name>Yuanyuan Zhang</name>\n    </author>\n    <author>\n      <name>Jorge Martinez</name>\n    </author>\n    <author>\n      <name>Odette Scharenborg</name>\n    </author>\n  </entry>"
}