Research

Paper

TESTING February 26, 2026

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Authors

Runwei Guan, Shaofeng Liang, Ningwei Ouyang, Weichen Fei, Shanliang Yao, Wei Dai, Chenhao Ge, Penglei Sun, Xiaohui Zhu, Tao Huang, Ryan Wen Liu, Hui Xiong

Abstract

While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

Metadata

arXiv ID: 2602.22923
Provider: ARXIV
Primary Category: cs.CV
Published: 2026-02-26
Fetched: 2026-02-27 04:35

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.22923v1</id>\n    <title>WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents</title>\n    <updated>2026-02-26T12:12:40Z</updated>\n    <link href='https://arxiv.org/abs/2602.22923v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.22923v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.RO'/>\n    <published>2026-02-26T12:12:40Z</published>\n    <arxiv:comment>11 pages,8 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Runwei Guan</name>\n    </author>\n    <author>\n      <name>Shaofeng Liang</name>\n    </author>\n    <author>\n      <name>Ningwei Ouyang</name>\n    </author>\n    <author>\n      <name>Weichen Fei</name>\n    </author>\n    <author>\n      <name>Shanliang Yao</name>\n    </author>\n    <author>\n      <name>Wei Dai</name>\n    </author>\n    <author>\n      <name>Chenhao Ge</name>\n    </author>\n    <author>\n      <name>Penglei Sun</name>\n    </author>\n    <author>\n      <name>Xiaohui Zhu</name>\n    </author>\n    <author>\n      <name>Tao Huang</name>\n    </author>\n    <author>\n      <name>Ryan Wen Liu</name>\n    </author>\n    <author>\n      <name>Hui Xiong</name>\n    </author>\n  </entry>"
}