Research

Paper

AI LLM March 02, 2026

HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

Authors

Maoliang Li, Jiayu Chen, Zihao Zheng, Ziqian Li, Xinhao Sun, Guojie Luo, Chenchen Liu, Xiang Chen

Abstract

With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.

Metadata

arXiv ID: 2603.01661
Provider: ARXIV
Primary Category: cs.DC
Published: 2026-03-02
Fetched: 2026-03-03 04:34

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.01661v1</id>\n    <title>HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC</title>\n    <updated>2026-03-02T09:51:01Z</updated>\n    <link href='https://arxiv.org/abs/2603.01661v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.01661v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\\times$ over existing deployment strategies, enabling practical on-device agentic RAG.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n    <published>2026-03-02T09:51:01Z</published>\n    <arxiv:comment>Will appear in DAC'2026</arxiv:comment>\n    <arxiv:primary_category term='cs.DC'/>\n    <author>\n      <name>Maoliang Li</name>\n    </author>\n    <author>\n      <name>Jiayu Chen</name>\n    </author>\n    <author>\n      <name>Zihao Zheng</name>\n    </author>\n    <author>\n      <name>Ziqian Li</name>\n    </author>\n    <author>\n      <name>Xinhao Sun</name>\n    </author>\n    <author>\n      <name>Guojie Luo</name>\n    </author>\n    <author>\n      <name>Chenchen Liu</name>\n    </author>\n    <author>\n      <name>Xiang Chen</name>\n    </author>\n  </entry>"
}