Research

Paper

TESTING March 12, 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Authors

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

Metadata

arXiv ID: 2603.12180

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-12

Fetched: 2026-03-13 06:02

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.12180v1</id>\n    <title>Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections</title>\n    <updated>2026-03-12T17:11:22Z</updated>\n    <link href='https://arxiv.org/abs/2603.12180v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.12180v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-12T17:11:22Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Łukasz Borchmann</name>\n    </author>\n    <author>\n      <name>Jordy Van Landeghem</name>\n    </author>\n    <author>\n      <name>Michał Turski</name>\n    </author>\n    <author>\n      <name>Shreyansh Padarha</name>\n    </author>\n    <author>\n      <name>Ryan Othniel Kearns</name>\n    </author>\n    <author>\n      <name>Adam Mahdi</name>\n    </author>\n    <author>\n      <name>Niels Rogge</name>\n    </author>\n    <author>\n      <name>Clémentine Fourrier</name>\n    </author>\n    <author>\n      <name>Siwei Han</name>\n    </author>\n    <author>\n      <name>Huaxiu Yao</name>\n    </author>\n    <author>\n      <name>Artemis Llabrés</name>\n    </author>\n    <author>\n      <name>Yiming Xu</name>\n    </author>\n    <author>\n      <name>Dimosthenis Karatzas</name>\n    </author>\n    <author>\n      <name>Hao Zhang</name>\n    </author>\n    <author>\n      <name>Anupam Datta</name>\n    </author>\n  </entry>"
}