Papers
Research papers from arXiv and related sources
Learning to Disprove: Formal Counterexample Generation with Large Language Models
Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI eff...
Zenan Li, Zhaoyu Li, Kaiyu Yang, Xiaoxing Ma, Zhendong Su
FedAgain: A Trust-Based and Robust Federated Learning Strategy for an Automated Kidney Stone Identification in Ureteroscopy
The reliability of artificial intelligence (AI) in medical imaging critically depends on its robustness to heterogeneous and corrupted images acquired with diverse devices across different hospital...
Ivan Reyes-Amezcua, Francisco Lopez-Tiro, Clément Larose, Christian Daul, Andres Mendez-Vazquez, ...
AI-Ready Control System for the Fermilab Accelerator Complex
Reliable, high-intensity operation of the Fermilab Accelerator Complex is critical to the success of the Long-Baseline Neutrino Facility and Deep Underground Neutrino Experiment. We describe the re...
Tia Miceli, Erik Gottschalk, Donovan Tooke, Evan Milton, Robert Santucci, Hayden Hoschouer, Micha...
Beyond the Desk: Barriers and Future Opportunities for AI to Assist Scientists in Embodied Physical Tasks
More scientists are now using AI, but prior studies have examined only how they use it 'at the desk' for computer-based work. However, given that scientific work often happens 'beyond the desk' at ...
Irene Hou, Alexander Qin, Lauren Cheng, Philip J. Guo
A Lanczos-based algorithm for sum-over-states calculations of NMR spin--spin coupling constants at the RPA level of theory: The Fermi-contact term
The analysis of nuclear magnetic resonance parameters, such as the indirect nuclear spin-spin coupling constants, in terms of contributions from localised molecular orbitals is a commonly used appr...
Sarah L. V. Zahn, Luna Zamok, Sonia Coriani, Stephan P. A. Sauer
Narrative Aligned Long Form Video Question Answering
Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture n...
Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficienc...
Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Hu...
A Framework for Formalizing LLM Agent Security
Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruction led to the a...
Vincent Siu, Jingxuan He, Kyle Montgomery, Zhun Wang, Neil Gong, Chenguang Wang, Dawn Song
Listen First, Then Answer: Timestamp-Grounded Speech Reasoning
Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we ...
Jihoon Jeong, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a s...
Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
Global Convergence of Multiplicative Updates for the Matrix Mechanism: A Collaborative Proof with Gemini 3
We analyze a fixed-point iteration $v \leftarrow φ(v)$ arising in the optimization of a regularized nuclear norm objective involving the Hadamard product structure, posed in~\cite{denisov} in the c...
Keith Rush
Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification
Robotic path planning problems are often NP-hard, and practical solutions typically rely on approximation algorithms with provable performance guarantees for general cases. While designing such alg...
Zhengbang Yang, Md. Tasin Tazwar, Minghan Wei, Zhuangdi Zhu
Hyperagents
Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed,...
Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tat...
Computer-Orchestrated Design of Algorithms: From Join Specification to Implementation
Equipping query processing systems with provable theoretical guarantees has been a central focus at the intersection of database theory and systems in recent years. However, the divergence between ...
Zeyuan Hu
Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, i...
Viliana Devbunova
Investigating In-Context Privacy Learning by Integrating User-Facing Privacy Tools into Conversational Agents
Supporting users in protecting sensitive information when using conversational agents (CAs) is crucial, as users may undervalue privacy protection due to outdated, partial, or inaccurate knowledge ...
Mohammad Hadi Nezhad, Francisco Enrique Vicente Castro, Ivon Arroyo
Ringdown modeling for effective-one-body waveforms in the test-mass limit for eccentric equatorial orbits around a Kerr black hole
We study the plunge and merger of a non-spinning particle falling into a Kerr black hole following an eccentric planar inspiral. The dynamics is driven by an effective-one-body radiation reaction, ...
Simone Albanesi, Sebastiano Bernuzzi, Alessandro Nagar
DePro: Understanding the Role of LLMs in Debugging Competitive Programming Code
Debugging consumes a substantial portion of the software development lifecycle, yet the effectiveness of Large Language Models(LLMs) in this task is not well understood. Competitive programming off...
Nabiha Parvez, Tanvin Sarkar Pallab, Mia Mohammad Imran, Tarannum Shaila Zaman
Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning
Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particul...
Xueqiao Peng, Andrew Perrault
Understanding Bell locality tests at colliders
For decades, it has been known that local hidden variable theories cannot be disproved by collider experiments involving decaying particles. However, if these theories satisfy a small set of mild a...
J. A. Aguilar-Saavedra, J. A. Casas, J. M. Moreno