Paper
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Authors
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre
Abstract
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.
Metadata
Related papers
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30
Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books
Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30
RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems
Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.04033v1</id>\n <title>Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA</title>\n <updated>2026-03-04T13:12:30Z</updated>\n <link href='https://arxiv.org/abs/2603.04033v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.04033v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n <published>2026-03-04T13:12:30Z</published>\n <arxiv:comment>Accepted in HeaLing Workshop - EACL 2026</arxiv:comment>\n <arxiv:primary_category term='cs.CL'/>\n <author>\n <name>Ikram Belmadani</name>\n </author>\n <author>\n <name>Oumaima El Khettari</name>\n </author>\n <author>\n <name>Pacôme Constant dit Beaufils</name>\n </author>\n <author>\n <name>Richard Dufour</name>\n </author>\n <author>\n <name>Benoit Favre</name>\n </author>\n </entry>"
}