Research

Paper

AI LLM March 24, 2026

PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving

Authors

Wenfeng Wang, Xiaofeng Hou, Peng Tang, Hengyi Zhou, Jing Wang, Xinkai Wang, Chao Li, Minyi Guo

Abstract

Retrieval-Augmented Generation (RAG) systems enhance the performance of large language models (LLMs) by incorporating supplementary retrieved documents, enabling more accurate and context-aware responses. However, integrating these external documents often results in very long input sequences, which significantly increases computation costs during the prefill stage, where key-value (KV) representations for all input tokens are generated. This latency bottleneck becomes especially pronounced under high-throughput serving scenarios. KV-cache reuse offers a promising solution by storing previously computed KV states for shared input prefixes, thereby avoiding redundant computation across requests that contain overlapping context. Yet, the effectiveness of cache reuse is often limited by three practical challenges: low cache hit rates due to naive eviction policies, high CPU-GPU data transfer overhead, and slow SSD I/O when caches spill to storage. To address these issues, we propose PCR, a system designed to maximize KV-cache reuse efficiency through intelligent prefetching and pipelined data movement. Specifically, PCR introduces three key techniques: (1) a prefix-tree caching structure with a look-ahead LRU replacement policy that uses pending requests in the scheduler queue to improve cache hit ratios; (2) layer-wise overlapping that pipelines KV-cache loading and GPU computation across CUDA streams to hide communication latency; and (3) queue-based prefetching that proactively loads relevant KV caches from SSD into DRAM before they are needed. Extensive experiments show that PCR outperforms existing KV-cache reuse methods, achieving up to a 2.47x speedup in terms of average TTFT.

Metadata

arXiv ID: 2603.23049
Provider: ARXIV
Primary Category: cs.DC
Published: 2026-03-24
Fetched: 2026-03-25 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.23049v1</id>\n    <title>PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving</title>\n    <updated>2026-03-24T10:40:58Z</updated>\n    <link href='https://arxiv.org/abs/2603.23049v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.23049v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Retrieval-Augmented Generation (RAG) systems enhance the performance of large language models (LLMs) by incorporating supplementary retrieved documents, enabling more accurate and context-aware responses. However, integrating these external documents often results in very long input sequences, which significantly increases computation costs during the prefill stage, where key-value (KV) representations for all input tokens are generated. This latency bottleneck becomes especially pronounced under high-throughput serving scenarios. KV-cache reuse offers a promising solution by storing previously computed KV states for shared input prefixes, thereby avoiding redundant computation across requests that contain overlapping context. Yet, the effectiveness of cache reuse is often limited by three practical challenges: low cache hit rates due to naive eviction policies, high CPU-GPU data transfer overhead, and slow SSD I/O when caches spill to storage. To address these issues, we propose PCR, a system designed to maximize KV-cache reuse efficiency through intelligent prefetching and pipelined data movement. Specifically, PCR introduces three key techniques: (1) a prefix-tree caching structure with a look-ahead LRU replacement policy that uses pending requests in the scheduler queue to improve cache hit ratios; (2) layer-wise overlapping that pipelines KV-cache loading and GPU computation across CUDA streams to hide communication latency; and (3) queue-based prefetching that proactively loads relevant KV caches from SSD into DRAM before they are needed. Extensive experiments show that PCR outperforms existing KV-cache reuse methods, achieving up to a 2.47x speedup in terms of average TTFT.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n    <published>2026-03-24T10:40:58Z</published>\n    <arxiv:primary_category term='cs.DC'/>\n    <author>\n      <name>Wenfeng Wang</name>\n    </author>\n    <author>\n      <name>Xiaofeng Hou</name>\n    </author>\n    <author>\n      <name>Peng Tang</name>\n    </author>\n    <author>\n      <name>Hengyi Zhou</name>\n    </author>\n    <author>\n      <name>Jing Wang</name>\n    </author>\n    <author>\n      <name>Xinkai Wang</name>\n    </author>\n    <author>\n      <name>Chao Li</name>\n    </author>\n    <author>\n      <name>Minyi Guo</name>\n    </author>\n  </entry>"
}