Research

Paper

TESTING March 04, 2026

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Authors

Xingyao Wang, Valerie Chen, Heng Ji, Graham Neubig

Abstract

Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

Metadata

arXiv ID: 2603.03800
Provider: ARXIV
Primary Category: cs.AI
Published: 2026-03-04
Fetched: 2026-03-05 06:06

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.03800v1</id>\n    <title>A Rubric-Supervised Critic from Sparse Real-World Outcomes</title>\n    <updated>2026-03-04T07:23:54Z</updated>\n    <link href='https://arxiv.org/abs/2603.03800v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.03800v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a \"critic\" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-04T07:23:54Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Xingyao Wang</name>\n    </author>\n    <author>\n      <name>Valerie Chen</name>\n    </author>\n    <author>\n      <name>Heng Ji</name>\n    </author>\n    <author>\n      <name>Graham Neubig</name>\n    </author>\n  </entry>"
}