Research

Paper

AI LLM March 09, 2026

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Authors

Shangeth Rajaa

Abstract

Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.

Metadata

arXiv ID: 2603.08216
Provider: ARXIV
Primary Category: eess.AS
Published: 2026-03-09
Fetched: 2026-03-10 05:43

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.08216v1</id>\n    <title>DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining</title>\n    <updated>2026-03-09T10:48:37Z</updated>\n    <link href='https://arxiv.org/abs/2603.08216v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.08216v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='eess.AS'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n    <published>2026-03-09T10:48:37Z</published>\n    <arxiv:comment>Submitted to Interspeech 2026</arxiv:comment>\n    <arxiv:primary_category term='eess.AS'/>\n    <author>\n      <name>Shangeth Rajaa</name>\n    </author>\n  </entry>"
}