Research

Paper

AI LLM March 13, 2026

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Authors

Zizhao Mo, Junlin Chen, Huanle Xu, Chengzhong Xu

Abstract

Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems.

Metadata

arXiv ID: 2603.12831

Provider: ARXIV

Primary Category: cs.DC

Published: 2026-03-13

Fetched: 2026-03-16 06:01

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.12831v1</id>\n    <title>Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking</title>\n    <updated>2026-03-13T09:32:56Z</updated>\n    <link href='https://arxiv.org/abs/2603.12831v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.12831v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service.\n  In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\\times$ while enhancing BE serving throughput by up to $9.85\\times$ compared to state-of-the-art systems.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n    <published>2026-03-13T09:32:56Z</published>\n    <arxiv:primary_category term='cs.DC'/>\n    <author>\n      <name>Zizhao Mo</name>\n    </author>\n    <author>\n      <name>Junlin Chen</name>\n    </author>\n    <author>\n      <name>Huanle Xu</name>\n    </author>\n    <author>\n      <name>Chengzhong Xu</name>\n    </author>\n    <arxiv:doi>10.1145/3802107</arxiv:doi>\n    <link href='https://doi.org/10.1145/3802107' rel='related' title='doi'/>\n  </entry>"
}