Research

Papers

Research papers from arXiv and related sources

Total: 4513 AI/LLM: 2483 Testing: 2030
AI LLM

Training Generalizable Collaborative Agents via Strategic Risk Aversion

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative pr...

Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar

2602.21515 2026-02-25
AI LLM

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate repu...

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu

2602.21496 2026-02-25
AI LLM

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivi...

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang

2602.21492 2026-02-25
AI LLM

StoryComposerAI: Supporting Human-AI Story Co-Creation Through Decomposition and Linking

GenAI's ability to produce text and images is increasingly incorporated into human-AI co-creation tasks such as storytelling and video editing. However, integrating GenAI into these tasks requires ...

Shuo Niu, Dylan Clements, Marina Margalit Nemanov, Hyungsin Kim

2602.21486 2026-02-25
AI LLM

Aletheia tackles FirstProof autonomously

We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the c...

Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Wo...

2602.21201 2026-02-24
AI LLM

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather t...

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi

2602.21198 2026-02-24
AI LLM

On Data Engineering for Scaling LLM Terminal Capabilities

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this...

Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping

2602.21193 2026-02-24
AI LLM

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$...

Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi

2602.21189 2026-02-24
AI LLM

XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence

Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as o...

Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser

2602.21178 2026-02-24
AI LLM

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1)...

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

2602.21172 2026-02-24
AI LLM

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting t...

Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil

2602.21161 2026-02-24
AI LLM

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various f...

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei

2602.21158 2026-02-24
AI LLM

Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is oft...

Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi

2602.21144 2026-02-24
AI LLM

A Benchmark for Deep Information Synthesis

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchma...

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger...

2602.21143 2026-02-24
AI LLM

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

As LLM deployments scale over more hardware, the probability of a single failure in a system increases significantly, and cloud operators must consider robust countermeasures to handle these inevit...

Haley Li, Xinglu Wang, Cong Feng, Chunxu Zuo, Yanan Wang, Hei Lo, Yufei Cui, Bingji Wang, Duo Cui...

2602.21140 2026-02-24
AI LLM

SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to condu...

David Anugraha, Vishakh Padmakumar, Diyi Yang

2602.21136 2026-02-24
AI LLM

"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surf...

Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong, Xiaofeng Wang

2602.21127 2026-02-24
AI LLM

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning s...

Sanket Badhe, Deep Shah

2602.21103 2026-02-24
AI LLM

Turning Semantics into Topology: LLM-Driven Attribute Augmentation for Collaborative Filtering

Large Language Models (LLMs) have shown great potential for enhancing recommender systems through their extensive world knowledge and reasoning capabilities. However, effectively translating these ...

Junjie Meng, Ranxu zhang, Wei Wu, Rui Zhang, Chuan Qin, Qi Zhang, Qi Liu, Hui Xiong, Chao Wang

2602.21099 2026-02-24
AI LLM

Can Interest-Bearing Positions Solve the Long-Horizon Problem in Prediction Markets?

Prediction markets suffer from reduced liquidity and price accuracy for long-horizon events due to the opportunity cost of committed capital. Recently, major platforms have introduced interest-bear...

Caleb Maresca

2602.21091 2026-02-24