Research

Paper

TESTING March 09, 2026

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Authors

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

Abstract

Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.

Metadata

arXiv ID: 2603.07997

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-03-09

Fetched: 2026-03-10 05:43

Related papers

Cosmic Shear in Effective Field Theory at Two-Loop Order: Revisiting $S_8$ in Dark Energy Survey Data

Shi-Fan Chen, Joseph DeRose, Mikhail M. Ivanov, Oliver H. E. Philcox • 2026-03-30

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, Da... • 2026-03-30

SNID-SAGE: A Modern Framework for Interactive Supernova Classification and Spectral Analysis

Fiorenzo Stoppa, Stephen J. Smartt • 2026-03-30

Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes

Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie • 2026-03-30

Rotating black hole shadows in metric-affine bumblebee gravity

Jose R. Nascimento, Ana R. M. Oliveira, Albert Yu. Petrov, Paulo J. Porfírio,... • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.07997v1</id>\n    <title>CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval</title>\n    <updated>2026-03-09T06:02:50Z</updated>\n    <link href='https://arxiv.org/abs/2603.07997v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.07997v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-09T06:02:50Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Haozhou Li</name>\n    </author>\n    <author>\n      <name>Xiangyu Dong</name>\n    </author>\n    <author>\n      <name>Huiyan Jiang</name>\n    </author>\n    <author>\n      <name>Yaoming Zhou</name>\n    </author>\n    <author>\n      <name>Xiaoguang Ma</name>\n    </author>\n  </entry>"
}