Paper
Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling
Authors
Wei Jiang, Tong Chen, Wei Yuan, Quoc Viet Hung Nguyen, Hongzhi Yin
Abstract
Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in the more complex mixed-source multi-modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero-shot M3D by connecting modality-specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi-agent framework for zero-shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test-time scaling paradigm in which each modality-specific VLM agent applies a Best-of-N mechanism, coupled with a critic agent for task-aligned scoring. The agents are organized in a cascading, modality-specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state-of-the-art zero-shot detection performance compared with various VLM-based and agentic baselines.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.02519v1</id>\n <title>Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling</title>\n <updated>2026-03-03T02:07:52Z</updated>\n <link href='https://arxiv.org/abs/2603.02519v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.02519v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in the more complex mixed-source multi-modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero-shot M3D by connecting modality-specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi-agent framework for zero-shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test-time scaling paradigm in which each modality-specific VLM agent applies a Best-of-N mechanism, coupled with a critic agent for task-aligned scoring. The agents are organized in a cascading, modality-specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state-of-the-art zero-shot detection performance compared with various VLM-based and agentic baselines.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.MM'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.IR'/>\n <published>2026-03-03T02:07:52Z</published>\n <arxiv:primary_category term='cs.MM'/>\n <author>\n <name>Wei Jiang</name>\n </author>\n <author>\n <name>Tong Chen</name>\n </author>\n <author>\n <name>Wei Yuan</name>\n </author>\n <author>\n <name>Quoc Viet Hung Nguyen</name>\n </author>\n <author>\n <name>Hongzhi Yin</name>\n </author>\n </entry>"
}