Paper
PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models
Authors
Binesh Sadanandan, Vahid Behzadan
Abstract
Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.21428v1</id>\n <title>PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models</title>\n <updated>2026-02-24T23:03:50Z</updated>\n <link href='https://arxiv.org/abs/2602.21428v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.21428v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n <published>2026-02-24T23:03:50Z</published>\n <arxiv:primary_category term='cs.CV'/>\n <author>\n <name>Binesh Sadanandan</name>\n </author>\n <author>\n <name>Vahid Behzadan</name>\n </author>\n </entry>"
}