Paper
Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design
Authors
Qunyou Liu, Marina Zapater, David Atienza
Abstract
Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.19057v1</id>\n <title>Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design</title>\n <updated>2026-03-19T15:50:26Z</updated>\n <link href='https://arxiv.org/abs/2603.19057v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.19057v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AR'/>\n <published>2026-03-19T15:50:26Z</published>\n <arxiv:primary_category term='cs.AR'/>\n <author>\n <name>Qunyou Liu</name>\n </author>\n <author>\n <name>Marina Zapater</name>\n </author>\n <author>\n <name>David Atienza</name>\n </author>\n </entry>"
}