Research

Paper

TESTING March 19, 2026

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Authors

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao

Abstract

Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

Metadata

arXiv ID: 2603.18411
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-19
Fetched: 2026-03-20 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.18411v1</id>\n    <title>TARo: Token-level Adaptive Routing for LLM Test-time Alignment</title>\n    <updated>2026-03-19T02:18:40Z</updated>\n    <link href='https://arxiv.org/abs/2603.18411v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.18411v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-19T02:18:40Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Arushi Rai</name>\n    </author>\n    <author>\n      <name>Qiang Zhang</name>\n    </author>\n    <author>\n      <name>Hanqing Zeng</name>\n    </author>\n    <author>\n      <name>Yunkai Zhang</name>\n    </author>\n    <author>\n      <name>Dipesh Tamboli</name>\n    </author>\n    <author>\n      <name>Xiangjun Fan</name>\n    </author>\n    <author>\n      <name>Zhuokai Zhao</name>\n    </author>\n  </entry>"
}