Research

Paper

AI LLM March 02, 2026

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Authors

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao

Abstract

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

Metadata

arXiv ID: 2603.02097

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-02

Fetched: 2026-03-03 04:34

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.02097v1</id>\n    <title>ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels</title>\n    <updated>2026-03-02T17:17:18Z</updated>\n    <link href='https://arxiv.org/abs/2603.02097v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.02097v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-02T17:17:18Z</published>\n    <arxiv:comment>8 pages, 6 figures,</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Xiang Zheng</name>\n    </author>\n    <author>\n      <name>Han Li</name>\n    </author>\n    <author>\n      <name>Wenjie Luo</name>\n    </author>\n    <author>\n      <name>Weiqi Zhai</name>\n    </author>\n    <author>\n      <name>Yiyuan Li</name>\n    </author>\n    <author>\n      <name>Chuanmiao Yan</name>\n    </author>\n    <author>\n      <name>Tianyi Tang</name>\n    </author>\n    <author>\n      <name>Yubo Ma</name>\n    </author>\n    <author>\n      <name>Kexin Yang</name>\n    </author>\n    <author>\n      <name>Dayiheng Liu</name>\n    </author>\n    <author>\n      <name>Hu Wei</name>\n    </author>\n    <author>\n      <name>Bing Zhao</name>\n    </author>\n  </entry>"
}