Research

Paper

AI LLM March 06, 2026

Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering

Authors

Mina Farajiamiri, Jeta Sopa, Saba Afza, Lisa Adams, Felix Barajas Ordonez, Tri-Thien Nguyen, Mahshad Lotfinia, Sebastian Wind, Keno Bressem, Sven Nebelung, Daniel Truhn, Soroosh Tayebi Arasteh

Abstract

Agentic retrieval-augmented reasoning pipelines are increasingly used to structure how large language models (LLMs) incorporate external evidence in clinical decision support. These systems iteratively retrieve curated domain knowledge and synthesize it into structured reports before answer selection. Although such pipelines can improve performance, their impact on reliability under model variability remains unclear. In real-world deployment, heterogeneous models may align, diverge, or synchronize errors in ways not captured by accuracy. We evaluated 34 LLMs on 169 expert-curated publicly available radiology questions, comparing zero-shot inference with a radiology-specific multi-step agentic retrieval condition in which all models received identical structured evidence reports derived from curated radiology knowledge. Agentic inference reduced inter-model decision dispersion (median entropy 0.48 vs. 0.13) and increased robustness of correctness across models (mean 0.74 vs. 0.81). Majority consensus also increased overall (P<0.001). Consensus strength and robust correctness remained correlated under both strategies (\r{ho}=0.88 for zero-shot; \r{ho}=0.87 for agentic), although high agreement did not guarantee correctness. Response verbosity showed no meaningful association with correctness. Among 572 incorrect outputs, 72% were associated with moderate or high clinically assessed severity, although inter-rater agreement was low (\k{appa}=0.02). Agentic retrieval therefore was associated with more concentrated decision distributions, stronger consensus, and higher cross-model robustness of correctness. These findings suggest that evaluating agentic systems through accuracy or agreement alone may not always be sufficient, and that complementary analyses of stability, cross-model robustness, and potential clinical impact are needed to characterize reliability under model variability.

Metadata

arXiv ID: 2603.06271
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-03-06
Fetched: 2026-03-09 06:05

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.06271v1</id>\n    <title>Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering</title>\n    <updated>2026-03-06T13:31:54Z</updated>\n    <link href='https://arxiv.org/abs/2603.06271v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.06271v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Agentic retrieval-augmented reasoning pipelines are increasingly used to structure how large language models (LLMs) incorporate external evidence in clinical decision support. These systems iteratively retrieve curated domain knowledge and synthesize it into structured reports before answer selection. Although such pipelines can improve performance, their impact on reliability under model variability remains unclear. In real-world deployment, heterogeneous models may align, diverge, or synchronize errors in ways not captured by accuracy. We evaluated 34 LLMs on 169 expert-curated publicly available radiology questions, comparing zero-shot inference with a radiology-specific multi-step agentic retrieval condition in which all models received identical structured evidence reports derived from curated radiology knowledge. Agentic inference reduced inter-model decision dispersion (median entropy 0.48 vs. 0.13) and increased robustness of correctness across models (mean 0.74 vs. 0.81). Majority consensus also increased overall (P&lt;0.001). Consensus strength and robust correctness remained correlated under both strategies (\\r{ho}=0.88 for zero-shot; \\r{ho}=0.87 for agentic), although high agreement did not guarantee correctness. Response verbosity showed no meaningful association with correctness. Among 572 incorrect outputs, 72% were associated with moderate or high clinically assessed severity, although inter-rater agreement was low (\\k{appa}=0.02). Agentic retrieval therefore was associated with more concentrated decision distributions, stronger consensus, and higher cross-model robustness of correctness. These findings suggest that evaluating agentic systems through accuracy or agreement alone may not always be sufficient, and that complementary analyses of stability, cross-model robustness, and potential clinical impact are needed to characterize reliability under model variability.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-06T13:31:54Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Mina Farajiamiri</name>\n    </author>\n    <author>\n      <name>Jeta Sopa</name>\n    </author>\n    <author>\n      <name>Saba Afza</name>\n    </author>\n    <author>\n      <name>Lisa Adams</name>\n    </author>\n    <author>\n      <name>Felix Barajas Ordonez</name>\n    </author>\n    <author>\n      <name>Tri-Thien Nguyen</name>\n    </author>\n    <author>\n      <name>Mahshad Lotfinia</name>\n    </author>\n    <author>\n      <name>Sebastian Wind</name>\n    </author>\n    <author>\n      <name>Keno Bressem</name>\n    </author>\n    <author>\n      <name>Sven Nebelung</name>\n    </author>\n    <author>\n      <name>Daniel Truhn</name>\n    </author>\n    <author>\n      <name>Soroosh Tayebi Arasteh</name>\n    </author>\n  </entry>"
}