Research

Paper

AI LLM February 24, 2026

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Authors

Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi

Abstract

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.

Metadata

arXiv ID: 2602.21189
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-02-24
Fetched: 2026-02-25 06:05

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.21189v1</id>\n    <title>Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training</title>\n    <updated>2026-02-24T18:43:08Z</updated>\n    <link href='https://arxiv.org/abs/2602.21189v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.21189v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-02-24T18:43:08Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Anas Barakat</name>\n    </author>\n    <author>\n      <name>Souradip Chakraborty</name>\n    </author>\n    <author>\n      <name>Khushbu Pahwa</name>\n    </author>\n    <author>\n      <name>Amrit Singh Bedi</name>\n    </author>\n  </entry>"
}