Research

Papers

Research papers from arXiv and related sources

Total: 4513 AI/LLM: 2483 Testing: 2030
AI LLM

Anticipate, Adapt, Act: A Hybrid Framework for Task Planning

Anticipating and adapting to failures is a key capability robots need to collaborate effectively with humans in complex domains. This continues to be a challenge despite the impressive performance ...

Nabanita Dash, Ayush Kaura, Shivam Singh, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, K....

2602.19518 2026-02-23
AI LLM

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \C...

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen

2602.19517 2026-02-23
AI LLM

Pixel2Phys: Distilling Governing Laws from Visual Dynamics

Discovering physical laws directly from high-dimensional visual data is a long-standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific ...

Ruikun Li, Jun Yao, Yingfan Hua, Shixiang Tang, Biqing Qi, Bin Liu, Wanli Ouyang, Yan Lu

2602.19516 2026-02-23
AI LLM

Security Risks of AI Agents Hiring Humans: An Empirical Marketplace Study

Autonomous AI agents can now programmatically hire human workers through marketplaces using REST APIs and Model Context Protocol (MCP) integrations. This creates an attack surface analogous to CAPT...

Pulak Mehta

2602.19514 2026-02-23
AI LLM

Real-time Win Probability and Latent Player Ability via STATS X in Team Sports

This study proposes a statistically grounded framework for real-time win probability evaluation and player assessment in score-based team sports, based on minute-by-minute cumulative box-score data...

Yasutaka Shimizu, Atsushi Yamanobe

2602.19513 2026-02-23
AI LLM

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohi...

Arindam Khaled

2602.19509 2026-02-23
AI LLM

Conversational AI for Automated Patient Questionnaire Completion: Development Insights and Design Principles

Collecting patient-reported outcome measures (PROMs) is essential for clinical care and research, yet traditional form-based approaches are often tedious for patients and burdensome for clinicians....

David Fraile Navarro, Mor Peleg

2602.19507 2026-02-23
AI LLM

Test-Time Computing for Referring Multimodal Large Language Models

We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visua...

Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu, Liujuan Cao, Ming-Ming Cheng, Rongron...

2602.19505 2026-02-23
TESTING

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abil...

Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

2602.19497 2026-02-23
TESTING

FuzzySQL: Uncovering Hidden Vulnerabilities in DBMS Special Features with LLM-Driven Fuzzing

Traditional database fuzzing techniques primarily focus on syntactic correctness and general SQL structures, leaving critical yet obscure DBMS features, such as system-level modes (e.g., GTID), pro...

Yongxin Chen, Zhiyuan Jiang, Chao Zhang, Haoran Xu, Shenglin Xu, Jianping Tang, Zheming Li, Peida...

2602.19490 2026-02-23
TESTING

Kaon decay constraints on vector bosons coupled to non-conserved currents

We study rare three- and four-body kaon decays as a probe of light vector and axial-vector bosons coupled to non-conserved currents. We find that searches for $K_L \to π^0 π^0 (X\to e^+e^-)$ decays...

Matheus Hostert, Maxim Pospelov, Adrian Thompson

2602.19479 2026-02-23
TESTING

Physics-Aware, Shannon-Optimal Compression via Arithmetic Coding for Distributional Fidelity

Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to p...

Cristiano Fanelli

2602.19476 2026-02-23
TESTING

Zero Variance Portfolio

When the number of assets is larger than the sample size, the minimum variance portfolio interpolates the training data, delivering pathological zero in-sample variance. We show that if the weights...

Jinyuan Chang, Yi Ding, Zhentao Shi, Bo Zhang

2602.19462 2026-02-23
TESTING

Optimal Error Estimates of a new Multiphysic Finite Element Method for Nonlinear Poroelasticity model with Hencky-Mises Stress Tensor

In this paper, we develop a new multiphysics finite element method for a nonlinear poroelastic model with Hencky-Mises stress tensor. By introducing some new notations, we reformulate the original ...

Yanan He, Zhihao Ge

2602.19457 2026-02-23
TESTING

HD-TTA: Hypothesis-Driven Test-Time Adaptation for Safer Brain Tumor Segmentation

Standard Test-Time Adaptation (TTA) methods typically treat inference as a blind optimization task, applying generic objectives to all or filtered test samples. In safety-critical medical segmentat...

Kartik Jhawar, Lipo Wang

2602.19454 2026-02-23
TESTING

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating system, yet real deployments remain vulnerable to microarc...

Kunal Mukherjee

2602.19450 2026-02-23
TESTING

OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents

Problem Definition. Supply chain optimization models frequently become infeasible because of modeling errors. Diagnosis and repair require scarce OR expertise: analysts must interpret solver diagno...

Ruicheng Ao, David Simchi-Levi, Xinshang Wang

2602.19439 2026-02-23
TESTING

A unified SPH framework for shell-related interactions

A unified Smoothed Particle Hydrodynamics (SPH) framework is proposed to simulate interaction dynamics involving thin shells modeled by a reduced-dimensional, single-layer particle discretization, ...

Dong Wu, Shuaihao Zhang, Weiyi Kong, Xiangyu Hu

2602.19429 2026-02-23
TESTING

How Robust are Robustness Checks?

Robustness checks are routine in empirical work, but there is no standard statistical procedure to formally measure what one can learn from them. I propose a "robustness radius" measure to quantify...

Brenda Prallon

2602.19384 2026-02-22
TESTING

On the Variability of Source Code in Maven Package Rebuilds

Rebuilding packages from open source is a common practice to improve the security of software supply chains, and is now done at an industrial scale. The basic principle is to acquire the source cod...

Jens Dietrich, Behnaz Hassanshahi

2602.19383 2026-02-22