Research

Papers

Research papers from arXiv and related sources

Total: 4513 AI/LLM: 2483 Testing: 2030
AI LLM

An LLM-driven Scenario Generation Pipeline Using an Extended Scenic DSL for Autonomous Driving Safety Validation

Real-world crash reports, which combine textual summaries and sketches, are valuable for scenario-based testing of autonomous driving systems (ADS). However, current methods cannot effectively tran...

Fida Khandaker Safa, Yupeng Jiang, Xi Zheng

2602.20644 2026-02-24
AI LLM

Grounding LLMs in Scientific Discovery via Embodied Actions

Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and verifiable physical simulation. Existing solut...

Bo Zhang, Jinfeng Zhou, Yuxuan Chen, Jianing Yin, Minlie Huang, Hongning Wang

2602.20639 2026-02-24
AI LLM

AI Combines, Humans Socialise: A SECI-based Experience Report on Business Simulation Games

Background. Business Simulation Games (BSG) are widely used to foster experiential learning in complex managerial and organisational contexts by exposing students to decision-making under uncertain...

Nordine Benkeltoum

2602.20633 2026-02-24
AI LLM

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-...

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor An...

2602.20629 2026-02-24
AI LLM

When can we trust untrusted monitoring? A safety case sketch across collusion strategies

AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one u...

Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Ales...

2602.20628 2026-02-24
AI LLM

Physics-based phenomenological characterization of cross-modal bias in multimodal models

The term 'algorithmic fairness' is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as "treat like cases as like") and no...

Hyeongmo Kim, Sohyun Kang, Yerin Choi, Seungyeon Ji, Junhyuk Woo, Hyunsuk Chung, Soyeon Caren Han...

2602.20624 2026-02-24
AI LLM

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In res...

Haonan An, Xiaohui Ye, Guang Hua, Yihang Tao, Hangcheng Cao, Xiangyu Yu, Yuguang Fang

2602.20618 2026-02-24
AI LLM

Amortized Bayesian inference for actigraph time sheet data from mobile devices

Mobile data technologies use ``actigraphs'' to furnish information on health variables as a function of a subject's movement. The advent of wearable devices and related technologies has propelled t...

Daniel Zhou, Sudipto Banerjee

2602.20611 2026-02-24
AI LLM

SpecMind: Cognitively Inspired, Interactive Multi-Turn Framework for Postcondition Inference

Specifications are vital for ensuring program correctness, yet writing them manually remains challenging and time-intensive. Recent large language model (LLM)-based methods have shown successes in ...

Cuong Chi Le, Minh V. T Pham, Tung Vu Duy, Cuong Duc Van, Huy N. Phan, Hoang N. Phan, Tien N. Nguyen

2602.20610 2026-02-24
AI LLM

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

Multi-tenant LLM serving frameworks widely adopt shared Key-Value caches to enhance efficiency. However, this creates side-channel vulnerabilities enabling prompt leakage attacks. Prior studies ide...

Longxiang Wang, Xiang Zheng, Xuhao Zhang, Yao Zhang, Ye Wu, Cong Wang

2602.20595 2026-02-24
AI LLM

Personal Information Parroting in Language Models

Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, ...

Nishant Subramani, Kshitish Ghate, Mona Diab

2602.20580 2026-02-24
AI LLM

Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference l...

Jiaru Zhang, Manav Gagvani, Can Cui, Juntong Peng, Ruqi Zhang, Ziran Wang

2602.20577 2026-02-24
AI LLM

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct st...

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

2602.20571 2026-02-24
AI LLM

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery d...

Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, J...

2602.20569 2026-02-24
AI LLM

From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production

Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs ...

Yucheng Shi, Ying Li, Yu Wang, Yesu Feng, Arjun Rao, Rein Houthooft, Shradha Sehgal, Jin Wang, Ha...

2602.20558 2026-02-24
AI LLM

CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently e...

Zhenran Tang, Rohan Nagabhirava, Changliu Liu

2602.20551 2026-02-24
AI LLM

What Drives Students' Use of AI Chatbots? Technology Acceptance in Conversational AI

Conversational AI tools have been rapidly adopted by students and are becoming part of their learning routines. To understand what drives this adoption, we draw on the Technology Acceptance Model (...

Griffin Pitts, Sanaz Motamedi

2602.20547 2026-02-24
AI LLM

Generative AI and Machine Learning Collaboration for Container Dwell Time Prediction via Data Standardization

Import container dwell time (ICDT) prediction is a key task for improving productivity in container terminals, as accurate predictions enable the reduction of container re-handling operations by ya...

Minseop Kim, Takhyeong Kim, Taekhyun Park, Hanbyeol Park, Hyerim Bae

2602.20540 2026-02-24
AI LLM

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this...

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, San...

2602.20532 2026-02-24
AI LLM

Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to tok...

Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian ...

2602.20528 2026-02-24