Research

Paper

AI LLM March 25, 2026

Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Authors

Zongliang Ji, Ziyang Zhang, Xincheng Tan, Matthew Thompson, Anna Goldenberg, Carl Yang, Rahul G. Krishnan, Fan Zhang

Abstract

Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.

Metadata

arXiv ID: 2603.23937

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-25

Fetched: 2026-03-26 06:02

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.23937v1</id>\n    <title>Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development</title>\n    <updated>2026-03-25T04:58:53Z</updated>\n    <link href='https://arxiv.org/abs/2603.23937v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.23937v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-25T04:58:53Z</published>\n    <arxiv:comment>9 pages. To appear in Proceedings of Machine Learning Research (PMLR), Machine Learning for Health (ML4H) Symposium 2025</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <arxiv:journal_ref>Proceedings of Machine Learning Research 2025</arxiv:journal_ref>\n    <author>\n      <name>Zongliang Ji</name>\n    </author>\n    <author>\n      <name>Ziyang Zhang</name>\n    </author>\n    <author>\n      <name>Xincheng Tan</name>\n    </author>\n    <author>\n      <name>Matthew Thompson</name>\n    </author>\n    <author>\n      <name>Anna Goldenberg</name>\n    </author>\n    <author>\n      <name>Carl Yang</name>\n    </author>\n    <author>\n      <name>Rahul G. Krishnan</name>\n    </author>\n    <author>\n      <name>Fan Zhang</name>\n    </author>\n  </entry>"
}