Research

Paper

AI LLM March 23, 2026

MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

Authors

Jack W O'Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Li Fe-Fei, Ehsan Adeli, Rima Arnaout, Euan A Ashley

Abstract

Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.

Metadata

arXiv ID: 2603.22179
Provider: ARXIV
Primary Category: cs.AI
Published: 2026-03-23
Fetched: 2026-03-24 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.22179v1</id>\n    <title>MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management</title>\n    <updated>2026-03-23T16:42:11Z</updated>\n    <link href='https://arxiv.org/abs/2603.22179v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.22179v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P&lt;0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-23T16:42:11Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Jack W O'Sullivan</name>\n    </author>\n    <author>\n      <name>Mohammad Asadi</name>\n    </author>\n    <author>\n      <name>Lennart Elbe</name>\n    </author>\n    <author>\n      <name>Akshay Chaudhari</name>\n    </author>\n    <author>\n      <name>Tahoura Nedaee</name>\n    </author>\n    <author>\n      <name>Francois Haddad</name>\n    </author>\n    <author>\n      <name>Michael Salerno</name>\n    </author>\n    <author>\n      <name>Li Fe-Fei</name>\n    </author>\n    <author>\n      <name>Ehsan Adeli</name>\n    </author>\n    <author>\n      <name>Rima Arnaout</name>\n    </author>\n    <author>\n      <name>Euan A Ashley</name>\n    </author>\n  </entry>"
}