Research

Papers

Research papers from arXiv and related sources

Total: 4513 AI/LLM: 2483 Testing: 2030
AI LLM

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly quest...

Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser

2602.17316 2026-02-19
AI LLM

Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE

Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education. Researc...

Valdemar Švábenský, Brendan Flanagan, Erwin Daniel López Zapata, Atsushi Shimada

2602.17314 2026-02-19
AI LLM

MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions

Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presenta...

Hui Min Wong, Philip Heesen, Pascal Janetzky, Martin Bendszus, Stefan Feuerriegel

2602.17308 2026-02-19
AI LLM

Human attribution of empathic behaviour to AI systems

Artificial intelligence systems increasingly generate text intended to provide social and emotional support. Understanding how users perceive empathic qualities in such content is therefore critica...

Jonas Festor, Ivo Snels, Bennett Kleinberg

2602.17293 2026-02-19
AI LLM

Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the su...

Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang, Yiming Li, Zhan Qin, Kui Ren

2602.17283 2026-02-19
AI LLM

Federated Latent Space Alignment for Multi-user Semantic Communications

Semantic communication aims to convey meaning for effective task execution, but differing latent representations in AI-native devices can cause semantic mismatches that hinder mutual understanding....

Giuseppe Di Poce, Mario Edoardo Pandolfo, Emilio Calvanese Strinati, Paolo Di Lorenzo

2602.17271 2026-02-19
AI LLM

On the Reliability of User-Centric Evaluation of Conversational Recommender Systems

User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable ...

Michael Müller, Amir Reza Mohammadi, Andreas Peintner, Beatriz Barroso Gstrein, Günther Specht, E...

2602.17264 2026-02-19
AI LLM

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments pre...

Kensuke Okada, Yui Furukawa, Kyosuke Bunji

2602.17262 2026-02-19
AI LLM

EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection meth...

Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan, Khanh Nguyen Quoc, Huan Vu, Phong Ho, Nae...

2602.17260 2026-02-19
AI LLM

On the Concept of Violence: A Comparative Study of Human and AI Judgments

Background: What counts as violence is neither self-evident nor universally agreed upon. While physical aggression is prototypical, contemporary societies increasingly debate whether exclusion, hum...

Mariachiara Stellato, Francesco Lancia, Chiara Galeazzi, Nico Curti

2602.17256 2026-02-19
AI LLM

Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web

The Web is evolving from a medium that humans browse to an environment where software agents act on behalf of users. Advances in large language models (LLMs) make natural language a practical inter...

Linxi Jiang, Rui Xi, Zhijie Liu, Shuo Chen, Zhiqiang Lin, Suman Nath

2602.17245 2026-02-19
AI LLM

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with informati...

Zeyu Zhang, Ryan Chen, Bradly C. Stadie

2602.17234 2026-02-19
AI LLM

Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations ...

Bianca Raimondi, Maurizio Gabbrielli

2602.17229 2026-02-19
AI LLM

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

As large language models (LLMs) continue to grow in size, fewer users are able to host and run models locally. This has led to increased use of third-party hosting services. However, in this settin...

Arka Pal, Louai Zahran, William Gvozdjak, Akilesh Potti, Micah Goldblum

2602.17223 2026-02-19
AI LLM

Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight

Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they o...

Ben Yellin, Ehud Ezra, Mark Foreman, Shula Grinapol

2602.17222 2026-02-19
AI LLM

From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences

Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities a...

Yi-Chih Huang

2602.17221 2026-02-19
AI LLM

NotebookRAG: Retrieving Multiple Notebooks to Augment the Generation of EDA Notebooks for Crowd-Wisdom

High-quality exploratory data analysis (EDA) is essential in the data science pipeline, but remains highly dependent on analysts' expertise and effort. While recent LLM-based approaches partially r...

Yi Shan, Yixuan He, Zekai Shao, Kai Xu, Siming Chen

2602.17215 2026-02-19
AI LLM

Extending quantum theory with AI-assisted deterministic game theory

We present an AI-assisted framework for predicting individual runs of complex quantum experiments, including contextuality and causality (adaptive measurements), within our long-term programme of d...

Florian Pauschitz, Ben Moseley, Ghislain Fourny

2602.17213 2026-02-19
AI LLM

Algorithmic Collusion at Test Time: A Meta-game Design and Evaluation

The threat of algorithmic collusion, and whether it merits regulatory intervention, remains debated, as existing evaluations of its emergence often rely on long learning horizons, assumptions about...

Yuhong Luo, Daniel Schoepflin, Xintong Wang

2602.17203 2026-02-19
AI LLM

The Case for HTML First Web Development

Since its introduction in the early 90s, the web has become the largest application platform available globally. HyperText Markup Language (HTML) has been an essential part of the web since the beg...

Juho Vepsäläinen

2602.17193 2026-02-19