Paper
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
Authors
Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang
Abstract
The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.21548v1</id>\n <title>DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference</title>\n <updated>2026-02-25T04:10:58Z</updated>\n <link href='https://arxiv.org/abs/2602.21548v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.21548v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput.\n We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines.\n Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\\times$ without violating SLO.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n <published>2026-02-25T04:10:58Z</published>\n <arxiv:primary_category term='cs.DC'/>\n <author>\n <name>Yongtong Wu</name>\n </author>\n <author>\n <name>Shaoyuan Chen</name>\n </author>\n <author>\n <name>Yinmin Zhong</name>\n </author>\n <author>\n <name>Rilin Huang</name>\n </author>\n <author>\n <name>Yixuan Tan</name>\n </author>\n <author>\n <name>Wentao Zhang</name>\n </author>\n <author>\n <name>Liyue Zhang</name>\n </author>\n <author>\n <name>Shangyan Zhou</name>\n </author>\n <author>\n <name>Yuxuan Liu</name>\n </author>\n <author>\n <name>Shunfeng Zhou</name>\n </author>\n <author>\n <name>Mingxing Zhang</name>\n </author>\n <author>\n <name>Xin Jin</name>\n </author>\n <author>\n <name>Panpan Huang</name>\n </author>\n </entry>"
}