Research

Paper

AI LLM February 24, 2026

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Authors

Guanbin Xu, ZhenGuo Xu, Yuzhe Li, Youhui Bai, Ping Gong, Chaoyi Ruan, Cheng Li

Abstract

Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.

Metadata

arXiv ID: 2602.20656
Provider: ARXIV
Primary Category: cs.DC
Published: 2026-02-24
Fetched: 2026-02-25 06:05

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.20656v1</id>\n    <title>Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training</title>\n    <updated>2026-02-24T08:00:38Z</updated>\n    <link href='https://arxiv.org/abs/2602.20656v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.20656v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n    <published>2026-02-24T08:00:38Z</published>\n    <arxiv:comment>6 pages, 8 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.DC'/>\n    <author>\n      <name>Guanbin Xu</name>\n    </author>\n    <author>\n      <name>ZhenGuo Xu</name>\n    </author>\n    <author>\n      <name>Yuzhe Li</name>\n    </author>\n    <author>\n      <name>Youhui Bai</name>\n    </author>\n    <author>\n      <name>Ping Gong</name>\n    </author>\n    <author>\n      <name>Chaoyi Ruan</name>\n    </author>\n    <author>\n      <name>Cheng Li</name>\n    </author>\n  </entry>"
}