Research

Paper

TESTING March 03, 2026

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Authors

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun

Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

Metadata

arXiv ID: 2603.03269

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-03-03

Fetched: 2026-03-04 03:41

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.03269v1</id>\n    <title>LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory</title>\n    <updated>2026-03-03T18:55:37Z</updated>\n    <link href='https://arxiv.org/abs/2603.03269v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.03269v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-03T18:55:37Z</published>\n    <arxiv:comment>Project page: https://LoGeR-project.github.io/</arxiv:comment>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Junyi Zhang</name>\n    </author>\n    <author>\n      <name>Charles Herrmann</name>\n    </author>\n    <author>\n      <name>Junhwa Hur</name>\n    </author>\n    <author>\n      <name>Chen Sun</name>\n    </author>\n    <author>\n      <name>Ming-Hsuan Yang</name>\n    </author>\n    <author>\n      <name>Forrester Cole</name>\n    </author>\n    <author>\n      <name>Trevor Darrell</name>\n    </author>\n    <author>\n      <name>Deqing Sun</name>\n    </author>\n  </entry>"
}