Paper
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Authors
Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.23361v1</id>\n <title>VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale</title>\n <updated>2026-02-26T18:59:33Z</updated>\n <link href='https://arxiv.org/abs/2602.23361v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.23361v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n <published>2026-02-26T18:59:33Z</published>\n <arxiv:comment>CVPR 2026, Project page: https://research.nvidia.com/labs/dvl/projects/vgg-ttt</arxiv:comment>\n <arxiv:primary_category term='cs.CV'/>\n <author>\n <name>Sven Elflein</name>\n </author>\n <author>\n <name>Ruilong Li</name>\n </author>\n <author>\n <name>Sérgio Agostinho</name>\n </author>\n <author>\n <name>Zan Gojcic</name>\n </author>\n <author>\n <name>Laura Leal-Taixé</name>\n </author>\n <author>\n <name>Qunjie Zhou</name>\n </author>\n <author>\n <name>Aljosa Osep</name>\n </author>\n </entry>"
}