Research

Paper

TESTING March 09, 2026

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Authors

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

Abstract

Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.

Metadata

arXiv ID: 2603.07997
Provider: ARXIV
Primary Category: cs.AI
Published: 2026-03-09
Fetched: 2026-03-10 05:43

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.07997v1</id>\n    <title>CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval</title>\n    <updated>2026-03-09T06:02:50Z</updated>\n    <link href='https://arxiv.org/abs/2603.07997v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.07997v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-09T06:02:50Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Haozhou Li</name>\n    </author>\n    <author>\n      <name>Xiangyu Dong</name>\n    </author>\n    <author>\n      <name>Huiyan Jiang</name>\n    </author>\n    <author>\n      <name>Yaoming Zhou</name>\n    </author>\n    <author>\n      <name>Xiaoguang Ma</name>\n    </author>\n  </entry>"
}