Paper
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
Authors
Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji
Abstract
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.16859v1</id>\n <title>SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models</title>\n <updated>2026-03-17T17:58:44Z</updated>\n <link href='https://arxiv.org/abs/2603.16859v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.16859v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <published>2026-03-17T17:58:44Z</published>\n <arxiv:comment>Code is available at https://github.com/MAC-AutoML/SocialOmni and dataset is available at https://huggingface.co/datasets/alexisty/SocialOmni</arxiv:comment>\n <arxiv:primary_category term='cs.AI'/>\n <author>\n <name>Tianyu Xie</name>\n </author>\n <author>\n <name>Jinfa Huang</name>\n </author>\n <author>\n <name>Yuexiao Ma</name>\n </author>\n <author>\n <name>Rongfang Luo</name>\n </author>\n <author>\n <name>Yan Yang</name>\n </author>\n <author>\n <name>Wang Chen</name>\n </author>\n <author>\n <name>Yuhui Zeng</name>\n </author>\n <author>\n <name>Ruize Fang</name>\n </author>\n <author>\n <name>Yixuan Zou</name>\n </author>\n <author>\n <name>Xiawu Zheng</name>\n </author>\n <author>\n <name>Jiebo Luo</name>\n </author>\n <author>\n <name>Rongrong Ji</name>\n </author>\n </entry>"
}