Papers
Research papers from arXiv and related sources
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative pr...
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate repu...
Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu
GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivi...
Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang
StoryComposerAI: Supporting Human-AI Story Co-Creation Through Decomposition and Linking
GenAI's ability to produce text and images is increasingly incorporated into human-AI co-creation tasks such as storytelling and video editing. However, integrating GenAI into these tasks requires ...
Shuo Niu, Dylan Clements, Marina Margalit Nemanov, Hyungsin Kim
Aletheia tackles FirstProof autonomously
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the c...
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Wo...
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather t...
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi
On Data Engineering for Scaling LLM Terminal Capabilities
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this...
Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$...
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as o...
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1)...
Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan
ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking
Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting t...
Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various f...
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei
Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is oft...
Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi
A Benchmark for Deep Information Synthesis
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchma...
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger...
ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
As LLM deployments scale over more hardware, the probability of a single failure in a system increases significantly, and cloud operators must consider robust countermeasures to handle these inevit...
Haley Li, Xinglu Wang, Cong Feng, Chunxu Zuo, Yanan Wang, Hei Lo, Yufei Cui, Bingji Wang, Duo Cui...
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to condu...
David Anugraha, Vishakh Padmakumar, Diyi Yang
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surf...
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong, Xiaofeng Wang
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning s...
Sanket Badhe, Deep Shah
Turning Semantics into Topology: LLM-Driven Attribute Augmentation for Collaborative Filtering
Large Language Models (LLMs) have shown great potential for enhancing recommender systems through their extensive world knowledge and reasoning capabilities. However, effectively translating these ...
Junjie Meng, Ranxu zhang, Wei Wu, Rui Zhang, Chuan Qin, Qi Zhang, Qi Liu, Hui Xiong, Chao Wang
Can Interest-Bearing Positions Solve the Long-Horizon Problem in Prediction Markets?
Prediction markets suffer from reduced liquidity and price accuracy for long-horizon events due to the opportunity cost of committed capital. Recently, major platforms have introduced interest-bear...
Caleb Maresca