Research

Paper

AI LLM March 13, 2026

Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

Authors

Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No

Abstract

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.

Metadata

arXiv ID: 2603.12996
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-03-13
Fetched: 2026-03-16 06:01

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.12996v1</id>\n    <title>Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs</title>\n    <updated>2026-03-13T13:52:02Z</updated>\n    <link href='https://arxiv.org/abs/2603.12996v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.12996v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-13T13:52:02Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Bumjun Kim</name>\n    </author>\n    <author>\n      <name>Dongjae Jeon</name>\n    </author>\n    <author>\n      <name>Moongyu Jeon</name>\n    </author>\n    <author>\n      <name>Albert No</name>\n    </author>\n  </entry>"
}