Research

Papers

Research papers from arXiv and related sources

Total: 4513 AI/LLM: 2483 Testing: 2030
AI LLM

Learning to Disprove: Formal Counterexample Generation with Large Language Models

Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI eff...

Zenan Li, Zhaoyu Li, Kaiyu Yang, Xiaoxing Ma, Zhendong Su

2603.19514 2026-03-19
AI LLM

FedAgain: A Trust-Based and Robust Federated Learning Strategy for an Automated Kidney Stone Identification in Ureteroscopy

The reliability of artificial intelligence (AI) in medical imaging critically depends on its robustness to heterogeneous and corrupted images acquired with diverse devices across different hospital...

Ivan Reyes-Amezcua, Francisco Lopez-Tiro, Clément Larose, Christian Daul, Andres Mendez-Vazquez, ...

2603.19512 2026-03-19
AI LLM

AI-Ready Control System for the Fermilab Accelerator Complex

Reliable, high-intensity operation of the Fermilab Accelerator Complex is critical to the success of the Long-Baseline Neutrino Facility and Deep Underground Neutrino Experiment. We describe the re...

Tia Miceli, Erik Gottschalk, Donovan Tooke, Evan Milton, Robert Santucci, Hayden Hoschouer, Micha...

2603.19507 2026-03-19
AI LLM

Beyond the Desk: Barriers and Future Opportunities for AI to Assist Scientists in Embodied Physical Tasks

More scientists are now using AI, but prior studies have examined only how they use it 'at the desk' for computer-based work. However, given that scientific work often happens 'beyond the desk' at ...

Irene Hou, Alexander Qin, Lauren Cheng, Philip J. Guo

2603.19504 2026-03-19
TESTING

A Lanczos-based algorithm for sum-over-states calculations of NMR spin--spin coupling constants at the RPA level of theory: The Fermi-contact term

The analysis of nuclear magnetic resonance parameters, such as the indirect nuclear spin-spin coupling constants, in terms of contributions from localised molecular orbitals is a commonly used appr...

Sarah L. V. Zahn, Luna Zamok, Sonia Coriani, Stephan P. A. Sauer

2603.19498 2026-03-19
TESTING

Narrative Aligned Long Form Video Question Answering

Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture n...

Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler

2603.19481 2026-03-19
AI LLM

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficienc...

Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Hu...

2603.19470 2026-03-19
AI LLM

A Framework for Formalizing LLM Agent Security

Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruction led to the a...

Vincent Siu, Jingxuan He, Kyle Montgomery, Zhun Wang, Neil Gong, Chenguang Wang, Dawn Song

2603.19469 2026-03-19
TESTING

Listen First, Then Answer: Timestamp-Grounded Speech Reasoning

Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we ...

Jihoon Jeong, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan

2603.19468 2026-03-19
TESTING

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a s...

Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini

2603.19466 2026-03-19
AI LLM

Global Convergence of Multiplicative Updates for the Matrix Mechanism: A Collaborative Proof with Gemini 3

We analyze a fixed-point iteration $v \leftarrow φ(v)$ arising in the optimization of a regularized nuclear norm objective involving the Hadamard product structure, posed in~\cite{denisov} in the c...

Keith Rush

2603.19465 2026-03-19
AI LLM

Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification

Robotic path planning problems are often NP-hard, and practical solutions typically rely on approximation algorithms with provable performance guarantees for general cases. While designing such alg...

Zhengbang Yang, Md. Tasin Tazwar, Minghan Wei, Zhuangdi Zhu

2603.19464 2026-03-19
AI LLM

Hyperagents

Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed,...

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tat...

2603.19461 2026-03-19
TESTING

Computer-Orchestrated Design of Algorithms: From Join Specification to Implementation

Equipping query processing systems with provable theoretical guarantees has been a central focus at the intersection of database theory and systems in recent years. However, the divergence between ...

Zeyuan Hu

2603.19434 2026-03-19
TESTING

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, i...

Viliana Devbunova

2603.19426 2026-03-19
TESTING

Investigating In-Context Privacy Learning by Integrating User-Facing Privacy Tools into Conversational Agents

Supporting users in protecting sensitive information when using conversational agents (CAs) is crucial, as users may undervalue privacy protection due to outdated, partial, or inaccurate knowledge ...

Mohammad Hadi Nezhad, Francisco Enrique Vicente Castro, Ivon Arroyo

2603.19416 2026-03-19
TESTING

Ringdown modeling for effective-one-body waveforms in the test-mass limit for eccentric equatorial orbits around a Kerr black hole

We study the plunge and merger of a non-spinning particle falling into a Kerr black hole following an eccentric planar inspiral. The dynamics is driven by an effective-one-body radiation reaction, ...

Simone Albanesi, Sebastiano Bernuzzi, Alessandro Nagar

2603.19413 2026-03-19
TESTING

DePro: Understanding the Role of LLMs in Debugging Competitive Programming Code

Debugging consumes a substantial portion of the software development lifecycle, yet the effectiveness of Large Language Models(LLMs) in this task is not well understood. Competitive programming off...

Nabiha Parvez, Tanvin Sarkar Pallab, Mia Mohammad Imran, Tarannum Shaila Zaman

2603.19399 2026-03-19
TESTING

Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning

Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particul...

Xueqiao Peng, Andrew Perrault

2603.19397 2026-03-19
TESTING

Understanding Bell locality tests at colliders

For decades, it has been known that local hidden variable theories cannot be disproved by collider experiments involving decaying particles. However, if these theories satisfy a small set of mild a...

J. A. Aguilar-Saavedra, J. A. Casas, J. M. Moreno

2603.19389 2026-03-19