Research

Paper

AI LLM February 27, 2026

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Authors

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou

Abstract

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.

Metadata

arXiv ID: 2602.24286

Provider: ARXIV

Primary Category: cs.LG

Published: 2026-02-27

Fetched: 2026-03-02 06:04

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.24286v1</id>\n    <title>CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation</title>\n    <updated>2026-02-27T18:58:05Z</updated>\n    <link href='https://arxiv.org/abs/2602.24286v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.24286v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\\%, 100\\%, and 92\\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\\% on the hardest Level-3 setting.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-02-27T18:58:05Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Weinan Dai</name>\n    </author>\n    <author>\n      <name>Hanlin Wu</name>\n    </author>\n    <author>\n      <name>Qiying Yu</name>\n    </author>\n    <author>\n      <name>Huan-ang Gao</name>\n    </author>\n    <author>\n      <name>Jiahao Li</name>\n    </author>\n    <author>\n      <name>Chengquan Jiang</name>\n    </author>\n    <author>\n      <name>Weiqiang Lou</name>\n    </author>\n    <author>\n      <name>Yufan Song</name>\n    </author>\n    <author>\n      <name>Hongli Yu</name>\n    </author>\n    <author>\n      <name>Jiaze Chen</name>\n    </author>\n    <author>\n      <name>Wei-Ying Ma</name>\n    </author>\n    <author>\n      <name>Ya-Qin Zhang</name>\n    </author>\n    <author>\n      <name>Jingjing Liu</name>\n    </author>\n    <author>\n      <name>Mingxuan Wang</name>\n    </author>\n    <author>\n      <name>Xin Liu</name>\n    </author>\n    <author>\n      <name>Hao Zhou</name>\n    </author>\n  </entry>"
}