Research

Papers

Research papers from arXiv and related sources

Total: 4694 AI/LLM: 2583 Testing: 2111
AI LLM

Invisible failures in human-AI interactions

AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisi...

Christopher Potts, Moritz Sudhof

2603.15423 2026-03-16
AI LLM

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without acc...

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino, Ming Jin, ...

2603.15417 2026-03-16
TESTING

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource lan...

Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang,...

2603.15409 2026-03-16
AI LLM

TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despi...

Kai Wang, Biaojie Zeng, Zeming Wei, Chang Jin, Hefeng Zhou, Xiangtian Li, Chao Yang, Jingjing Qu,...

2603.15408 2026-03-16
AI LLM

Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models

Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prom...

Zehao Chen, Rong Pan

2603.15405 2026-03-16
AI LLM

A Closer Look into LLMs for Table Understanding

Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs,...

Jia Wang, Chuanyu Qin, Mingyu Zheng, Qingyi Si, Peize Li, Zheng Lin

2603.15402 2026-03-16
AI LLM

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-e...

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu

2603.15401 2026-03-16
TESTING

Spatial Characterization of Sub-Synchronous Oscillations Using Black-Box IBR Models

Power systems with high penetration of inverter-based resources (IBRs) are prone to sub-synchronous oscillations (SSO). The opaqueness of vendor-specific IBR models limits the ability to predict th...

Muhammad Sharjeel Javaid, Gabriel Covarrubias Maureira, Ambuj Gupta, Debraj Bhattacharjee, Jianli...

2603.15399 2026-03-16
AI LLM

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment...

Yu Pan, Wenlong Yu, Tiejun Wu, Xiaohu Ye, Qiannan Si, Guangquan Xu, Bin Wu

2603.15397 2026-03-16
AI LLM

AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

Facial identification systems are increasingly deployed in surveillance and yet their vulnerability to adversarial evasion and impersonation attacks pose a critical risk. This paper introduces a no...

Noe Claudel, Weisi Guo, Yang Xing

2603.15396 2026-03-16
AI LLM

When Does Sparsity Mitigate the Curse of Depth in LLMs

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is lin...

Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwe...

2603.15389 2026-03-16
AI LLM

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-t...

Fernando Ropero, Erkin Turkoz, Daniel Matos, Junqing Du, Antonio Ruiz, Yanfeng Zhang, Lu Liu, Min...

2603.15386 2026-03-16
AI LLM

Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science

We critically examine the limitations of current AI models in achieving autonomous learning and propose a learning architecture inspired by human and animal cognition. The proposed framework integr...

Emmanuel Dupoux, Yann LeCun, Jitendra Malik

2603.15381 2026-03-16
AI LLM

More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search

Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, witho...

Gal Dalal, Assaf Hallak, Gal Chechik, Yftach Ziser

2603.15377 2026-03-16
AI LLM

Formalizing and validating properties in Asmeta with Large Language Models (Extended Abstract)

Writing temporal logic properties is often a challenging task for users of model-based development frameworks, particularly when translating informal requirements into formal specifications. In thi...

Andrea Bombarda, Silvia Bonfanti, Angelo Gargantini, Nico Pellegrinelli

2603.15375 2026-03-16
AI LLM

GradCFA: A Hybrid Gradient-Based Counterfactual and Feature Attribution Explanation Algorithm for Local Interpretation of Neural Networks

Explainable Artificial Intelligence (XAI) is increasingly essential as AI systems are deployed in critical fields such as healthcare and finance, offering transparency into AI-driven decisions. Two...

Jacob Sanderson, Hua Mao, Wai Lok Woo

2603.15373 2026-03-16
AI LLM

SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations

As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom...

Ivo Brett

2603.15372 2026-03-16
AI LLM

Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi-step reasoning remains a fundamental challenge. While Large Reasoning...

Guangfu Hao, Yuming Dai, Xianzhe Qin, Shan Yu

2603.15371 2026-03-16
AI LLM

CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

As AVs grow in complexity and diversity, identifying the root causes of operational failures has become increasingly complex. The heterogeneity of system architectures across manufacturers, ranging...

Erick Silva, Rehana Yasmin, Ali Shoker

2603.15364 2026-03-16
AI LLM

PMAx: An Agentic Framework for AI-Driven Process Mining

Process mining provides powerful insights into organizational workflows, but extracting these insights typically requires expertise in specialized query languages and data science tools. Large Lang...

Anton Antonov, Humam Kourani, Alessandro Berti, Gyunam Park, Wil M. P. van der Aalst

2603.15351 2026-03-16