Research

Paper

AI LLM March 06, 2026

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Authors

Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Changran Hu, Qizheng Zhang, Urmish Thakker

Abstract

Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.

Metadata

arXiv ID: 2603.05829
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-03-06
Fetched: 2026-03-09 06:05

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.05829v1</id>\n    <title>Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls</title>\n    <updated>2026-03-06T02:25:02Z</updated>\n    <link href='https://arxiv.org/abs/2603.05829v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.05829v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-06T02:25:02Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Shubhangi Upasani</name>\n    </author>\n    <author>\n      <name>Chen Wu</name>\n    </author>\n    <author>\n      <name>Jay Rainton</name>\n    </author>\n    <author>\n      <name>Bo Li</name>\n    </author>\n    <author>\n      <name>Changran Hu</name>\n    </author>\n    <author>\n      <name>Qizheng Zhang</name>\n    </author>\n    <author>\n      <name>Urmish Thakker</name>\n    </author>\n  </entry>"
}