Research

Paper

AI LLM March 23, 2026

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Authors

Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang

Abstract

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

Metadata

arXiv ID: 2603.22212

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-03-23

Fetched: 2026-03-24 06:02

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.22212v1</id>\n    <title>Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models</title>\n    <updated>2026-03-23T17:10:29Z</updated>\n    <link href='https://arxiv.org/abs/2603.22212v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.22212v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-03-23T17:10:29Z</published>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Meiqi Wu</name>\n    </author>\n    <author>\n      <name>Zhixin Cai</name>\n    </author>\n    <author>\n      <name>Fufangchen Zhao</name>\n    </author>\n    <author>\n      <name>Xiaokun Feng</name>\n    </author>\n    <author>\n      <name>Rujing Dang</name>\n    </author>\n    <author>\n      <name>Bingze Song</name>\n    </author>\n    <author>\n      <name>Ruitian Tian</name>\n    </author>\n    <author>\n      <name>Jiashu Zhu</name>\n    </author>\n    <author>\n      <name>Jiachen Lei</name>\n    </author>\n    <author>\n      <name>Hao Dou</name>\n    </author>\n    <author>\n      <name>Jing Tang</name>\n    </author>\n    <author>\n      <name>Lei Sun</name>\n    </author>\n    <author>\n      <name>Jiahong Wu</name>\n    </author>\n    <author>\n      <name>Xiangxiang Chu</name>\n    </author>\n    <author>\n      <name>Zeming Liu</name>\n    </author>\n    <author>\n      <name>Kaiqi Huang</name>\n    </author>\n  </entry>"
}