Research

Papers

Research papers from arXiv and related sources

Total: 4694 AI/LLM: 2583 Testing: 2111
AI LLM

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential fo...

Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki

2603.04857 2026-03-05
AI LLM

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and...

Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou

2603.04855 2026-03-05
AI LLM

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated fr...

Minduli Lasandi, Nevidu Jayatilleke

2603.04854 2026-03-05
AI LLM

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors

Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization t...

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

2603.04852 2026-03-05
AI LLM

Why Is RLHF Alignment Shallow? A Gradient Analysis

Why is safety alignment in LLMs shallow? We prove that gradient-based alignment inherently concentrates on positions where harm is decided and vanishes beyond. Using a martingale decomposition of s...

Robin Young

2603.04851 2026-03-05
AI LLM

Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models

We introduce the Dynamic Behavioral Constraint (DBC) benchmark, the first empirical framework for evaluating the efficacy of a structured, 150-control behavioral governance layer, the MDBC (Madan D...

G. Madan Mohan, Veena Kiran Nambiar, Kiranmayee Janardhan

2603.04837 2026-03-05
AI LLM

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical feature...

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan

2603.04828 2026-03-05
AI LLM

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-gra...

Jiawei Chen, Tianzhuo Yang, Guoxi Zhang, Jiaming Ji, Yaodong Yang, Juntao Dai

2603.04822 2026-03-05
AI LLM

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect s...

Michael Hardy

2603.04820 2026-03-05
TESTING

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

Port congestion at major maritime hubs disrupts global supply chains, yet existing prediction systems typically prioritize forecasting accuracy without providing operationally interpretable explana...

Zhiming Xue, Yujue Wang

2603.04818 2026-03-05
AI LLM

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

Manipulative communication, such as gaslighting, guilt-tripping, and emotional coercion, is often difficult for individuals to recognize. Existing agentic AI systems lack the structured, longitudin...

Ratna Kandala, Niva Manchanda, Akshata Kishore Moharir, Ananth Kandala

2603.04815 2026-03-05
AI LLM

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts a...

Natchanon Pollertlam, Witchayut Kornsuwannawit

2603.04814 2026-03-05
TESTING

Detection of GNSS Interference Using Reflected Signal Ob-servations from the LEO Satellite Constellation

Radio Frequency Interference (RFI) is a growing concern for Global Navigation Satellite System (GNSS) reliability. The Cyclone GNSS (CYGNSS) constellation, designed for ocean wind retrieval via GNS...

Ji-Hyeon Shin, Pyo-Woong Son

2603.04813 2026-03-05
AI LLM

SparkTales: Facilitating Cross-Language Collaborative Storytelling through Coordinator-AI Collaboration

Cross-language collaborative storytelling plays a vital role in children's language learning and cultural development, fostering both expressive ability and intercultural awareness. Yet, in practic...

Wenxin Zhao, Peng Zhang, Hansu Gu, Haoxuan Zhou, Xiaojie Huo, Lin Wang, Wen Zheng, Tun Lu, Ning Gu

2603.04806 2026-03-05
TESTING

Can LLMs Synthesize Court-Ready Statistical Evidence? Evaluating AI-Assisted Sentencing Bias Analysis for California Racial Justice Act Claims

Resentencing in California remains a complex legal challenge despite legislative reforms like the Racial Justice Act (2020), which allows defendants to challenge convictions based on statistical ev...

Aparna Komarla

2603.04804 2026-03-05
AI LLM

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models...

Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu, Kun Li, Yongliang Tao

2603.04800 2026-03-05
AI LLM

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

Large language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unifi...

Nan Hou, Kangfei Zhao, Jiadong Xie, Jeffrey Xu Yu

2603.04799 2026-03-05
AI LLM

Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

Large language models (LLMs) have been widely deployed for online generative services, where numerous LLM instances jointly handle workloads with fluctuating request arrival rates and variable requ...

Cong Li, Yihan Yin, Chenhao Xue, Zhao Wang, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Yuan Xi...

2603.04797 2026-03-05
TESTING

Could the interaction of jet and SN ejecta be the cause of X-ray knots observed in a radio galaxy?

We investigate the interaction between relativistic jets and supernova (SN) ejecta as a potential origin of X-ray knots in radio galaxies, employing knot A in M 87 as a test case. By modeling the d...

Jia-Chun He, Xiao-Na Sun, Hao-Qiang Zhang, Yun-Feng Liang, Hai-Ming Zhang, Da-Bin Lin, En-Wei Liang

2603.04781 2026-03-05
TESTING

MOOSEnger -- a Domain-Specific AI Agent for the MOOSE Ecosystem

MOOSEnger is a tool-enabled AI agent tailored to the Multiphysics Object-Oriented Simulation Environment (MOOSE). MOOSE cases are specified in HIT ".i" input files; the large object catalog and str...

Mengnan Li, Jason Miller, Zachary Prince, Alexander Lindsay, Cody Permann

2603.04756 2026-03-05