Research

Paper

AI LLM March 11, 2026

World2Act: Latent Action Post-Training via Skill-Compositional World Models

Authors

An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

Abstract

World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

Metadata

arXiv ID: 2603.10422

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-03-11

Fetched: 2026-03-12 04:21

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.10422v1</id>\n    <title>World2Act: Latent Action Post-Training via Skill-Compositional World Models</title>\n    <updated>2026-03-11T05:11:44Z</updated>\n    <link href='https://arxiv.org/abs/2603.10422v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.10422v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-03-11T05:11:44Z</published>\n    <arxiv:comment>Project page: https://wm2act.github.io/</arxiv:comment>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>An Dinh Vuong</name>\n    </author>\n    <author>\n      <name>Tuan Van Vo</name>\n    </author>\n    <author>\n      <name>Abdullah Sohail</name>\n    </author>\n    <author>\n      <name>Haoran Ding</name>\n    </author>\n    <author>\n      <name>Liang Ma</name>\n    </author>\n    <author>\n      <name>Xiaodan Liang</name>\n    </author>\n    <author>\n      <name>Anqing Duan</name>\n    </author>\n    <author>\n      <name>Ivan Laptev</name>\n    </author>\n    <author>\n      <name>Ian Reid</name>\n    </author>\n  </entry>"
}