Research

Paper

TESTING February 26, 2026

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Authors

Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

Abstract

We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.

Metadata

arXiv ID: 2602.22623

Provider: ARXIV

Primary Category: cs.LG

Published: 2026-02-26

Fetched: 2026-02-27 04:35

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.22623v1</id>\n    <title>ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL</title>\n    <updated>2026-02-26T04:55:57Z</updated>\n    <link href='https://arxiv.org/abs/2602.22623v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.22623v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to \"recover\" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-02-26T04:55:57Z</published>\n    <arxiv:comment>14 pages, 5 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Xingyu Lu</name>\n    </author>\n    <author>\n      <name>Jinpeng Wang</name>\n    </author>\n    <author>\n      <name>YiFan Zhang</name>\n    </author>\n    <author>\n      <name>Shijie Ma</name>\n    </author>\n    <author>\n      <name>Xiao Hu</name>\n    </author>\n    <author>\n      <name>Tianke Zhang</name>\n    </author>\n    <author>\n      <name>Haonan fan</name>\n    </author>\n    <author>\n      <name>Kaiyu Jiang</name>\n    </author>\n    <author>\n      <name>Changyi Liu</name>\n    </author>\n    <author>\n      <name>Kaiyu Tang</name>\n    </author>\n    <author>\n      <name>Bin Wen</name>\n    </author>\n    <author>\n      <name>Fan Yang</name>\n    </author>\n    <author>\n      <name>Tingting Gao</name>\n    </author>\n    <author>\n      <name>Han Li</name>\n    </author>\n    <author>\n      <name>Chun Yuan</name>\n    </author>\n  </entry>"
}