Research

Paper

TESTING February 25, 2026

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

Authors

Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu

Abstract

Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www

Metadata

arXiv ID: 2602.22209
Provider: ARXIV
Primary Category: cs.CV
Published: 2026-02-25
Fetched: 2026-02-26 05:00

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.22209v1</id>\n    <title>WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos</title>\n    <updated>2026-02-25T18:59:10Z</updated>\n    <link href='https://arxiv.org/abs/2602.22209v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.22209v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-02-25T18:59:10Z</published>\n    <arxiv:comment>Project website: https://judyye.github.io/whole-www</arxiv:comment>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Yufei Ye</name>\n    </author>\n    <author>\n      <name>Jiaman Li</name>\n    </author>\n    <author>\n      <name>Ryan Rong</name>\n    </author>\n    <author>\n      <name>C. Karen Liu</name>\n    </author>\n  </entry>"
}