Research

Paper

AI LLM March 06, 2026

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Authors

Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song

Abstract

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.

Metadata

arXiv ID: 2603.05910

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-03-06

Fetched: 2026-03-09 06:05

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.05910v1</id>\n    <title>The World Won't Stay Still: Programmable Evolution for Agent Benchmarks</title>\n    <updated>2026-03-06T04:56:18Z</updated>\n    <link href='https://arxiv.org/abs/2603.05910v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.05910v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-06T04:56:18Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Guangrui Li</name>\n    </author>\n    <author>\n      <name>Yaochen Xie</name>\n    </author>\n    <author>\n      <name>Yi Liu</name>\n    </author>\n    <author>\n      <name>Ziwei Dong</name>\n    </author>\n    <author>\n      <name>Xingyuan Pan</name>\n    </author>\n    <author>\n      <name>Tianqi Zheng</name>\n    </author>\n    <author>\n      <name>Jason Choi</name>\n    </author>\n    <author>\n      <name>Michael J. Morais</name>\n    </author>\n    <author>\n      <name>Binit Jha</name>\n    </author>\n    <author>\n      <name>Shaunak Mishra</name>\n    </author>\n    <author>\n      <name>Bingrou Zhou</name>\n    </author>\n    <author>\n      <name>Chen Luo</name>\n    </author>\n    <author>\n      <name>Monica Xiao Cheng</name>\n    </author>\n    <author>\n      <name>Dawn Song</name>\n    </author>\n  </entry>"
}