Papers
Research papers from arXiv and related sources
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly quest...
Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser
Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE
Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education. Researc...
Valdemar Švábenský, Brendan Flanagan, Erwin Daniel López Zapata, Atsushi Shimada
MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions
Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presenta...
Hui Min Wong, Philip Heesen, Pascal Janetzky, Martin Bendszus, Stefan Feuerriegel
Human attribution of empathic behaviour to AI systems
Artificial intelligence systems increasingly generate text intended to provide social and emotional support. Understanding how users perceive empathic qualities in such content is therefore critica...
Jonas Festor, Ivo Snels, Bennett Kleinberg
Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the su...
Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang, Yiming Li, Zhan Qin, Kui Ren
Federated Latent Space Alignment for Multi-user Semantic Communications
Semantic communication aims to convey meaning for effective task execution, but differing latent representations in AI-native devices can cause semantic mismatches that hinder mutual understanding....
Giuseppe Di Poce, Mario Edoardo Pandolfo, Emilio Calvanese Strinati, Paolo Di Lorenzo
On the Reliability of User-Centric Evaluation of Conversational Recommender Systems
User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable ...
Michael Müller, Amir Reza Mohammadi, Andreas Peintner, Beatriz Barroso Gstrein, Günther Specht, E...
Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments pre...
Kensuke Okada, Yui Furukawa, Kyosuke Bunji
EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection meth...
Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan, Khanh Nguyen Quoc, Huan Vu, Phong Ho, Nae...
On the Concept of Violence: A Comparative Study of Human and AI Judgments
Background: What counts as violence is neither self-evident nor universally agreed upon. While physical aggression is prototypical, contemporary societies increasingly debate whether exclusion, hum...
Mariachiara Stellato, Francesco Lancia, Chiara Galeazzi, Nico Curti
Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web
The Web is evolving from a medium that humans browse to an environment where software agents act on behalf of users. Advances in large language models (LLMs) make natural language a practical inter...
Linxi Jiang, Rui Xi, Zhijie Liu, Shuo Chen, Zhiqiang Lin, Suman Nath
All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting
To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with informati...
Zeyu Zhang, Ryan Chen, Bradly C. Stadie
Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations ...
Bianca Raimondi, Maurizio Gabbrielli
Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs
As large language models (LLMs) continue to grow in size, fewer users are able to host and run models locally. This has led to increased use of third-party hosting services. However, in this settin...
Arka Pal, Louai Zahran, William Gvozdjak, Akilesh Potti, Micah Goldblum
Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight
Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they o...
Ben Yellin, Ehud Ezra, Mark Foreman, Shula Grinapol
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities a...
Yi-Chih Huang
NotebookRAG: Retrieving Multiple Notebooks to Augment the Generation of EDA Notebooks for Crowd-Wisdom
High-quality exploratory data analysis (EDA) is essential in the data science pipeline, but remains highly dependent on analysts' expertise and effort. While recent LLM-based approaches partially r...
Yi Shan, Yixuan He, Zekai Shao, Kai Xu, Siming Chen
Extending quantum theory with AI-assisted deterministic game theory
We present an AI-assisted framework for predicting individual runs of complex quantum experiments, including contextuality and causality (adaptive measurements), within our long-term programme of d...
Florian Pauschitz, Ben Moseley, Ghislain Fourny
Algorithmic Collusion at Test Time: A Meta-game Design and Evaluation
The threat of algorithmic collusion, and whether it merits regulatory intervention, remains debated, as existing evaluations of its emergence often rely on long learning horizons, assumptions about...
Yuhong Luo, Daniel Schoepflin, Xintong Wang
The Case for HTML First Web Development
Since its introduction in the early 90s, the web has become the largest application platform available globally. HyperText Markup Language (HTML) has been an essential part of the web since the beg...
Juho Vepsäläinen