Papers
Research papers from arXiv and related sources
Anticipate, Adapt, Act: A Hybrid Framework for Task Planning
Anticipating and adapting to failures is a key capability robots need to collaborate effectively with humans in complex domains. This continues to be a challenge despite the impressive performance ...
Nabanita Dash, Ayush Kaura, Shivam Singh, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, K....
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \C...
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen
Pixel2Phys: Distilling Governing Laws from Visual Dynamics
Discovering physical laws directly from high-dimensional visual data is a long-standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific ...
Ruikun Li, Jun Yao, Yingfan Hua, Shixiang Tang, Biqing Qi, Bin Liu, Wanli Ouyang, Yan Lu
Security Risks of AI Agents Hiring Humans: An Empirical Marketplace Study
Autonomous AI agents can now programmatically hire human workers through marketplaces using REST APIs and Model Context Protocol (MCP) integrations. This creates an attack surface analogous to CAPT...
Pulak Mehta
Real-time Win Probability and Latent Player Ability via STATS X in Team Sports
This study proposes a statistically grounded framework for real-time win probability evaluation and player assessment in score-based team sports, based on minute-by-minute cumulative box-score data...
Yasutaka Shimizu, Atsushi Yamanobe
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohi...
Arindam Khaled
Conversational AI for Automated Patient Questionnaire Completion: Development Insights and Design Principles
Collecting patient-reported outcome measures (PROMs) is essential for clinical care and research, yet traditional form-based approaches are often tedious for patients and burdensome for clinicians....
David Fraile Navarro, Mor Peleg
Test-Time Computing for Referring Multimodal Large Language Models
We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visua...
Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu, Liujuan Cao, Ming-Ming Cheng, Rongron...
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abil...
Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
FuzzySQL: Uncovering Hidden Vulnerabilities in DBMS Special Features with LLM-Driven Fuzzing
Traditional database fuzzing techniques primarily focus on syntactic correctness and general SQL structures, leaving critical yet obscure DBMS features, such as system-level modes (e.g., GTID), pro...
Yongxin Chen, Zhiyuan Jiang, Chao Zhang, Haoran Xu, Shenglin Xu, Jianping Tang, Zheming Li, Peida...
Kaon decay constraints on vector bosons coupled to non-conserved currents
We study rare three- and four-body kaon decays as a probe of light vector and axial-vector bosons coupled to non-conserved currents. We find that searches for $K_L \to π^0 π^0 (X\to e^+e^-)$ decays...
Matheus Hostert, Maxim Pospelov, Adrian Thompson
Physics-Aware, Shannon-Optimal Compression via Arithmetic Coding for Distributional Fidelity
Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to p...
Cristiano Fanelli
Zero Variance Portfolio
When the number of assets is larger than the sample size, the minimum variance portfolio interpolates the training data, delivering pathological zero in-sample variance. We show that if the weights...
Jinyuan Chang, Yi Ding, Zhentao Shi, Bo Zhang
Optimal Error Estimates of a new Multiphysic Finite Element Method for Nonlinear Poroelasticity model with Hencky-Mises Stress Tensor
In this paper, we develop a new multiphysics finite element method for a nonlinear poroelastic model with Hencky-Mises stress tensor. By introducing some new notations, we reformulate the original ...
Yanan He, Zhihao Ge
HD-TTA: Hypothesis-Driven Test-Time Adaptation for Safer Brain Tumor Segmentation
Standard Test-Time Adaptation (TTA) methods typically treat inference as a blind optimization task, applying generic objectives to all or filtered test samples. In safety-critical medical segmentat...
Kartik Jhawar, Lipo Wang
Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments
Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating system, yet real deployments remain vulnerable to microarc...
Kunal Mukherjee
OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents
Problem Definition. Supply chain optimization models frequently become infeasible because of modeling errors. Diagnosis and repair require scarce OR expertise: analysts must interpret solver diagno...
Ruicheng Ao, David Simchi-Levi, Xinshang Wang
A unified SPH framework for shell-related interactions
A unified Smoothed Particle Hydrodynamics (SPH) framework is proposed to simulate interaction dynamics involving thin shells modeled by a reduced-dimensional, single-layer particle discretization, ...
Dong Wu, Shuaihao Zhang, Weiyi Kong, Xiangyu Hu
How Robust are Robustness Checks?
Robustness checks are routine in empirical work, but there is no standard statistical procedure to formally measure what one can learn from them. I propose a "robustness radius" measure to quantify...
Brenda Prallon
On the Variability of Source Code in Maven Package Rebuilds
Rebuilding packages from open source is a common practice to improve the security of software supply chains, and is now done at an industrial scale. The basic principle is to acquire the source cod...
Jens Dietrich, Behnaz Hassanshahi