Research

Papers

Research papers from arXiv and related sources

Total: 4694 AI/LLM: 2583 Testing: 2111
AI LLM

Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-c...

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du, Dacheng Tao

2603.11495 2026-03-12
AI LLM

PRMB: Benchmarking Reward Models in Long-Horizon CBT-based Counseling Dialogue

Large language models (LLMs) hold potential for mental healthcare applications, particularly in cognitive behavioral therapy (CBT)-based counseling, where reward models play a critical role in alig...

Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Liang He

2603.11494 2026-03-12
TESTING

SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation

In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical ...

Xiaogang Du, Jiawei Zhang, Tongfei Liu, Tao Lei, Yingbo Wang

2603.11492 2026-03-12
TESTING

AutoVeriFix+: High-Correctness RTL Generation via Trace-Aware Causal Fix and Semantic Redundancy Pruning

Large language models (LLMs) have demonstrated impressive capabilities in generating software code for high-level programming languages such as Python and C++. However, their application to hardwar...

Yan Tan, Xiangchen Meng, Zijun Jiang, Yangdi Lyu

2603.11489 2026-03-12
TESTING

Quantized Inference for OneRec-V2

Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommen...

Yi Su, Xinchen Luo, Hongtao Cheng, Ziteng Shu, Yunfeng Zhao, Fangyu Zhang, Jiaqiang Liu, Xiao Lia...

2603.11486 2026-03-12
AI LLM

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world ...

Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

2603.11481 2026-03-12
AI LLM

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

Time Series Event Detection (TSED) has long been an important task with critical applications across many high-stakes domains. Unlike statistical anomalies, events are defined by semantics with com...

Sky Chenwei Wan, Tianjun Hou, Yifei Wang, Xiqing Chang, Aymeric Jan

2603.11479 2026-03-12
TESTING

Graph Generation Methods under Partial Information

We study the problem of generating graphs with prescribed degree sequences for bipartite, directed, and undirected networks. We first propose a sequential method for bipartite graph generation and ...

Tong Sun, Jianshu Hao, Michael C. Fu, Guangxin Jiang

2603.11478 2026-03-12
TESTING

Leveraging Phytolith Research using Artificial Intelligence

Phytolith analysis is a crucial tool for reconstructing past vegetation and human activities, but traditional methods are severely limited by labour-intensive, time-consuming manual microscopy. To ...

Andrés G. Mejía Ramón, Kate Dudgeon, Nina Witteveen, Dolores Piperno, Michael Kloster, Luigi Palo...

2603.11476 2026-03-12
AI LLM

Deep Learning Network-Temporal Models For Traffic Prediction

Time series analysis is critical for emerging net- work intelligent control and management functions. However, existing statistical-based and shallow machine learning models have shown limited pred...

Yufeng Xin, Ethan Fan

2603.11475 2026-03-12
TESTING

Stochastic Optimization and Coupling

We study optimization problems in which a linear functional is maximized over probability measures that are dominated by a given measure according to an integral stochastic order in an arbitrary di...

Frank Yang, Kai Hao Yang

2603.11448 2026-03-12
TESTING

Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution

We present Verified Multi-Agent Orchestration (VMAO), a framework that coordinates specialized LLM-based agents through a verification-driven iterative loop. Given a complex query, our system decom...

Xing Zhang, Yanwei Cui, Guanghui Wang, Qucy Wei Qiu, Ziyuan Li, Fangwei Han, Yajing Huang, Hengzh...

2603.11445 2026-03-12
TESTING

NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication

NCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unver...

Yusheng Zheng

2603.11438 2026-03-12
TESTING

ZTab: Domain-based Zero-shot Annotation for Table Columns

This study addresses the challenge of automatically detecting semantic column types in relational tables, a key task in many real-world applications. Zero-shot modeling eliminates the need for user...

Ehsan Hoseinzade, Ke Wang

2603.11436 2026-03-12
TESTING

Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs

Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generali...

Jensen Gao, Dorsa Sadigh, Sandy Huang, Dhruv Shah

2603.11426 2026-03-12
TESTING

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely un...

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

2603.11417 2026-03-12
TESTING

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluatio...

David Fraile Navarro, Farah Magrabi, Enrico Coiera

2603.11413 2026-03-12
TESTING

Reproducible Synthetic Clinical Letters for Seizure Frequency Information Extraction

Seizure-frequency information is important for epilepsy research and clinical care, but it is usually recorded in variable free-text clinic letters that are hard to annotate and share. We developed...

Yujian Gan, Stephen H. Barlow, Ben Holgate, Joe Davies, James T. Teo, Joel S. Winston, Mark P. Ri...

2603.11407 2026-03-12
TESTING

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Teleoperation of low-cost robotic manipulators remains challenging due to the complexity of mapping human hand articulations to robot joint commands. We present an offline hand-shadowing and retarg...

Hendrik Chiche, Antoine Jamme, Trevor Rigoberto Martinez

2603.11383 2026-03-11
TESTING

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continu...

Christopher Altman

2603.11382 2026-03-11