Research

Paper

AI LLM March 05, 2026

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Authors

Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Abstract

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.

Metadata

arXiv ID: 2603.05413

Provider: ARXIV

Primary Category: cs.SD

Published: 2026-03-05

Fetched: 2026-03-06 14:20

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.05413v1</id>\n    <title>Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial</title>\n    <updated>2026-03-05T17:35:59Z</updated>\n    <link href='https://arxiv.org/abs/2603.05413v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.05413v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\\rightarrow$ LLM $\\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather \\textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n    <published>2026-03-05T17:35:59Z</published>\n    <arxiv:primary_category term='cs.SD'/>\n    <author>\n      <name>Jielin Qiu</name>\n    </author>\n    <author>\n      <name>Zixiang Chen</name>\n    </author>\n    <author>\n      <name>Liangwei Yang</name>\n    </author>\n    <author>\n      <name>Ming Zhu</name>\n    </author>\n    <author>\n      <name>Zhiwei Liu</name>\n    </author>\n    <author>\n      <name>Juntao Tan</name>\n    </author>\n    <author>\n      <name>Wenting Zhao</name>\n    </author>\n    <author>\n      <name>Rithesh Murthy</name>\n    </author>\n    <author>\n      <name>Roshan Ram</name>\n    </author>\n    <author>\n      <name>Akshara Prabhakar</name>\n    </author>\n    <author>\n      <name>Shelby Heinecke</name>\n    </author>\n    <author>\n      <name>Caiming Xiong</name>\n    </author>\n    <author>\n      <name>Silvio Savarese</name>\n    </author>\n    <author>\n      <name>Huan Wang</name>\n    </author>\n  </entry>"
}