Research

Paper

AI LLM March 12, 2026

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Authors

Eddie Landesberg

Abstract

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

Metadata

arXiv ID: 2603.12520
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-03-12
Fetched: 2026-03-16 06:01

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.12520v1</id>\n    <title>When LLM Judge Scores Look Good but Best-of-N Decisions Fail</title>\n    <updated>2026-03-12T23:40:03Z</updated>\n    <link href='https://arxiv.org/abs/2603.12520v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.12520v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt.\n  In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons.\n  In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-12T23:40:03Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Eddie Landesberg</name>\n    </author>\n  </entry>"
}