Research

Paper

AI LLM March 25, 2026

Self-Distillation for Multi-Token Prediction

Authors

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

Abstract

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

Metadata

arXiv ID: 2603.23911
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-25
Fetched: 2026-03-26 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.23911v1</id>\n    <title>Self-Distillation for Multi-Token Prediction</title>\n    <updated>2026-03-25T04:00:29Z</updated>\n    <link href='https://arxiv.org/abs/2603.23911v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.23911v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-25T04:00:29Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Guoliang Zhao</name>\n    </author>\n    <author>\n      <name>Ruobing Xie</name>\n    </author>\n    <author>\n      <name>An Wang</name>\n    </author>\n    <author>\n      <name>Shuaipeng Li</name>\n    </author>\n    <author>\n      <name>Huaibing Xie</name>\n    </author>\n    <author>\n      <name>Xingwu Sun</name>\n    </author>\n  </entry>"
}