Research

Paper

AI LLM February 26, 2026

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Authors

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

Metadata

arXiv ID: 2602.23300
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-02-26
Fetched: 2026-02-27 04:35

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.23300v1</id>\n    <title>A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations</title>\n    <updated>2026-02-26T18:08:40Z</updated>\n    <link href='https://arxiv.org/abs/2602.23300v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.23300v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='eess.AS'/>\n    <published>2026-02-26T18:08:40Z</published>\n    <arxiv:comment>Accepted to Elsevier Computer Speech and Language. 30 pages, 9 figures, 5 tables</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Soumya Dutta</name>\n    </author>\n    <author>\n      <name>Smruthi Balaji</name>\n    </author>\n    <author>\n      <name>Sriram Ganapathy</name>\n    </author>\n  </entry>"
}