Research

Papers

Research papers from arXiv and related sources

Total: 4694 AI/LLM: 2583 Testing: 2111
AI LLM

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential fo...

Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki

2603.04857 2026-03-05
AI LLM

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and...

Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou

2603.04855 2026-03-05
AI LLM

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated fr...

Minduli Lasandi, Nevidu Jayatilleke

2603.04854 2026-03-05
AI LLM

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors

Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization t...

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

2603.04852 2026-03-05
AI LLM

Why Is RLHF Alignment Shallow? A Gradient Analysis

Why is safety alignment in LLMs shallow? We prove that gradient-based alignment inherently concentrates on positions where harm is decided and vanishes beyond. Using a martingale decomposition of s...

Robin Young

2603.04851 2026-03-05
AI LLM

Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models

We introduce the Dynamic Behavioral Constraint (DBC) benchmark, the first empirical framework for evaluating the efficacy of a structured, 150-control behavioral governance layer, the MDBC (Madan D...

G. Madan Mohan, Veena Kiran Nambiar, Kiranmayee Janardhan

2603.04837 2026-03-05
AI LLM

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical feature...

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan

2603.04828 2026-03-05
AI LLM

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-gra...

Jiawei Chen, Tianzhuo Yang, Guoxi Zhang, Jiaming Ji, Yaodong Yang, Juntao Dai

2603.04822 2026-03-05
AI LLM

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect s...

Michael Hardy

2603.04820 2026-03-05
AI LLM

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

Manipulative communication, such as gaslighting, guilt-tripping, and emotional coercion, is often difficult for individuals to recognize. Existing agentic AI systems lack the structured, longitudin...

Ratna Kandala, Niva Manchanda, Akshata Kishore Moharir, Ananth Kandala

2603.04815 2026-03-05
AI LLM

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts a...

Natchanon Pollertlam, Witchayut Kornsuwannawit

2603.04814 2026-03-05
AI LLM

SparkTales: Facilitating Cross-Language Collaborative Storytelling through Coordinator-AI Collaboration

Cross-language collaborative storytelling plays a vital role in children's language learning and cultural development, fostering both expressive ability and intercultural awareness. Yet, in practic...

Wenxin Zhao, Peng Zhang, Hansu Gu, Haoxuan Zhou, Xiaojie Huo, Lin Wang, Wen Zheng, Tun Lu, Ning Gu

2603.04806 2026-03-05
AI LLM

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models...

Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu, Kun Li, Yongliang Tao

2603.04800 2026-03-05
AI LLM

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

Large language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unifi...

Nan Hou, Kangfei Zhao, Jiadong Xie, Jeffrey Xu Yu

2603.04799 2026-03-05
AI LLM

Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

Large language models (LLMs) have been widely deployed for online generative services, where numerous LLM instances jointly handle workloads with fluctuating request arrival rates and variable requ...

Cong Li, Yihan Yin, Chenhao Xue, Zhao Wang, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Yuan Xi...

2603.04797 2026-03-05
AI LLM

SELDON: Supernova Explosions Learned by Deep ODE Networks

The discovery rate of optical transients will explode to 10 million public alerts per night once the Vera C. Rubin Observatory's Legacy Survey of Space and Time comes online, overwhelming the tradi...

Jiezhong Wu, Jack O'Brien, Jennifer Li, M. S. Krafczyk, Ved G. Shah, Amanda R. Wasserman, Daniel ...

2603.04392 2026-03-04
AI LLM

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction fail...

Boyuan, Guan, Wencong Cui, Levente Juhasz

2603.04390 2026-03-04
AI LLM

Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce ...

Furkan Mumcu, Yasin Yilmaz

2603.04378 2026-03-04
AI LLM

LLM-supported 3D Modeling Tool for Radio Radiance Field Reconstruction

Accurate channel estimation is essential for massive multiple-input multiple-output (MIMO) technologies in next-generation wireless communications. Recently, the radio radiance field (RRF) has emer...

Chengling Xu, Huiwen Zhang, Haijian Sun, Feng Ye

2603.04368 2026-03-04
AI LLM

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored atta...

Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng

2603.04364 2026-03-04