Research

Paper

TESTING March 16, 2026

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Authors

Marc Casals-Salvador, Federico Costa, Rodolfo Zevallos, Javier Hernando

Abstract

Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency, providing practical insights for designing scalable SER systems.

Metadata

arXiv ID: 2603.15120

Provider: ARXIV

Primary Category: eess.AS

Published: 2026-03-16

Fetched: 2026-03-17 06:02

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.15120v1</id>\n    <title>How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition</title>\n    <updated>2026-03-16T11:16:27Z</updated>\n    <link href='https://arxiv.org/abs/2603.15120v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.15120v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency, providing practical insights for designing scalable SER systems.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='eess.AS'/>\n    <published>2026-03-16T11:16:27Z</published>\n    <arxiv:primary_category term='eess.AS'/>\n    <author>\n      <name>Marc Casals-Salvador</name>\n    </author>\n    <author>\n      <name>Federico Costa</name>\n    </author>\n    <author>\n      <name>Rodolfo Zevallos</name>\n    </author>\n    <author>\n      <name>Javier Hernando</name>\n    </author>\n  </entry>"
}