Research

Paper

AI LLM March 04, 2026

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

Authors

Taehan Lee, Jaehan Jung, Hyukjun Lee

Abstract

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.

Metadata

arXiv ID: 2603.03855

Provider: ARXIV

Primary Category: cs.SD

Published: 2026-03-04

Fetched: 2026-03-05 06:06

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.03855v1</id>\n    <title>A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs</title>\n    <updated>2026-03-04T09:04:23Z</updated>\n    <link href='https://arxiv.org/abs/2603.03855v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.03855v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SD'/>\n    <published>2026-03-04T09:04:23Z</published>\n    <arxiv:comment>6 pages, Submitted to Interspeech 2026</arxiv:comment>\n    <arxiv:primary_category term='cs.SD'/>\n    <author>\n      <name>Taehan Lee</name>\n    </author>\n    <author>\n      <name>Jaehan Jung</name>\n    </author>\n    <author>\n      <name>Hyukjun Lee</name>\n    </author>\n  </entry>"
}