Research

Paper

TESTING February 19, 2026

Computer-Using World Model

Authors

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, Pu Zhao, Lukas Wutschitz, Samuel Kessler, Huseyin A Inan, Robert Sim, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

Abstract

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

Metadata

arXiv ID: 2602.17365

Provider: ARXIV

Primary Category: cs.SE

Published: 2026-02-19

Fetched: 2026-02-21 18:51

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.17365v1</id>\n    <title>Computer-Using World Model</title>\n    <updated>2026-02-19T13:48:29Z</updated>\n    <link href='https://arxiv.org/abs/2602.17365v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.17365v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SE'/>\n    <published>2026-02-19T13:48:29Z</published>\n    <arxiv:comment>35 pages, 7 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.SE'/>\n    <author>\n      <name>Yiming Guan</name>\n    </author>\n    <author>\n      <name>Rui Yu</name>\n    </author>\n    <author>\n      <name>John Zhang</name>\n    </author>\n    <author>\n      <name>Lu Wang</name>\n    </author>\n    <author>\n      <name>Chaoyun Zhang</name>\n    </author>\n    <author>\n      <name>Liqun Li</name>\n    </author>\n    <author>\n      <name>Bo Qiao</name>\n    </author>\n    <author>\n      <name>Si Qin</name>\n    </author>\n    <author>\n      <name>He Huang</name>\n    </author>\n    <author>\n      <name>Fangkai Yang</name>\n    </author>\n    <author>\n      <name>Pu Zhao</name>\n    </author>\n    <author>\n      <name>Lukas Wutschitz</name>\n    </author>\n    <author>\n      <name>Samuel Kessler</name>\n    </author>\n    <author>\n      <name>Huseyin A Inan</name>\n    </author>\n    <author>\n      <name>Robert Sim</name>\n    </author>\n    <author>\n      <name>Saravan Rajmohan</name>\n    </author>\n    <author>\n      <name>Qingwei Lin</name>\n    </author>\n    <author>\n      <name>Dongmei Zhang</name>\n    </author>\n  </entry>"
}