Research

Paper

AI LLM March 10, 2026

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Authors

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang

Abstract

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.

Metadata

arXiv ID: 2603.09821

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-10

Fetched: 2026-03-11 06:02

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.09821v1</id>\n    <title>One-Eval: An Agentic System for Automated and Traceable LLM Evaluation</title>\n    <updated>2026-03-10T15:45:51Z</updated>\n    <link href='https://arxiv.org/abs/2603.09821v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.09821v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \\&amp; Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-10T15:45:51Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Chengyu Shen</name>\n    </author>\n    <author>\n      <name>Yanheng Hou</name>\n    </author>\n    <author>\n      <name>Minghui Pan</name>\n    </author>\n    <author>\n      <name>Runming He</name>\n    </author>\n    <author>\n      <name>Zhen Hao Wong</name>\n    </author>\n    <author>\n      <name>Meiyi Qiang</name>\n    </author>\n    <author>\n      <name>Zhou Liu</name>\n    </author>\n    <author>\n      <name>Hao Liang</name>\n    </author>\n    <author>\n      <name>Peichao Lai</name>\n    </author>\n    <author>\n      <name>Zeang Sheng</name>\n    </author>\n    <author>\n      <name>Wentao Zhang</name>\n    </author>\n  </entry>"
}