Research

Paper

AI LLM March 02, 2026

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Authors

Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu, Zheng Li, Yifan Song, Dawei Zhu, Xingxing Zhang, Furu Wei, Sujian Li

Abstract

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.

Metadata

arXiv ID: 2603.01639

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-02

Fetched: 2026-03-03 04:34

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.01639v1</id>\n    <title>Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning</title>\n    <updated>2026-03-02T09:17:48Z</updated>\n    <link href='https://arxiv.org/abs/2603.01639v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.01639v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-02T09:17:48Z</published>\n    <arxiv:comment>22pages, 7 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Jiebin Zhang</name>\n    </author>\n    <author>\n      <name>Zhenghan Yu</name>\n    </author>\n    <author>\n      <name>Liang Wang</name>\n    </author>\n    <author>\n      <name>Nan Yang</name>\n    </author>\n    <author>\n      <name>Eugene J. Yu</name>\n    </author>\n    <author>\n      <name>Zheng Li</name>\n    </author>\n    <author>\n      <name>Yifan Song</name>\n    </author>\n    <author>\n      <name>Dawei Zhu</name>\n    </author>\n    <author>\n      <name>Xingxing Zhang</name>\n    </author>\n    <author>\n      <name>Furu Wei</name>\n    </author>\n    <author>\n      <name>Sujian Li</name>\n    </author>\n  </entry>"
}