Paper
HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity
Authors
Wenyi Wang, Zheng Wu, Yanmeng Wang, Haolin Mao, Lei Han, Gaogang Xie, Fu Xiao
Abstract
In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \texttt{HyGra}, a hybrid-granularity network-state simulator that exploits intrinsic network dynamics in LLM training to adaptively switch simulation granularity. Specifically, \texttt{HyGra} employs packet-level simulation during non-steady phases with transient fluctuations and flow-level simulation during steady phases with periodic patterns, thereby accelerating execution while preserving high fidelity. Moreover, it requires no specialized hardware, supports single-machine deployment, and is compatible with existing simulators. Experiments based representative commercial LLM workloads, including ChatGPT, DeepSeek, and Qwen, show that \texttt{HyGra} achieves up to 15.4$\times$ speedup under single parallelization strategy and 7.8$\times$ under hybrid parallelization strategies while maintaining high accuracy.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.12671v1</id>\n <title>HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity</title>\n <updated>2026-03-13T05:33:01Z</updated>\n <link href='https://arxiv.org/abs/2603.12671v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.12671v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \\texttt{HyGra}, a hybrid-granularity network-state simulator that exploits intrinsic network dynamics in LLM training to adaptively switch simulation granularity. Specifically, \\texttt{HyGra} employs packet-level simulation during non-steady phases with transient fluctuations and flow-level simulation during steady phases with periodic patterns, thereby accelerating execution while preserving high fidelity. Moreover, it requires no specialized hardware, supports single-machine deployment, and is compatible with existing simulators. Experiments based representative commercial LLM workloads, including ChatGPT, DeepSeek, and Qwen, show that \\texttt{HyGra} achieves up to 15.4$\\times$ speedup under single parallelization strategy and 7.8$\\times$ under hybrid parallelization strategies while maintaining high accuracy.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.NI'/>\n <published>2026-03-13T05:33:01Z</published>\n <arxiv:comment>14 pages, 7 figures and 5 tables</arxiv:comment>\n <arxiv:primary_category term='cs.NI'/>\n <author>\n <name>Wenyi Wang</name>\n </author>\n <author>\n <name>Zheng Wu</name>\n </author>\n <author>\n <name>Yanmeng Wang</name>\n </author>\n <author>\n <name>Haolin Mao</name>\n </author>\n <author>\n <name>Lei Han</name>\n </author>\n <author>\n <name>Gaogang Xie</name>\n </author>\n <author>\n <name>Fu Xiao</name>\n </author>\n </entry>"
}