Papers
Research papers from arXiv and related sources
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate th...
Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li
PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A
Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtl...
Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning proble...
Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu
Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise
Towards understanding the statistical complexity of learning from heterogeneous sources, we study the problem of multi-distribution learning. Given $k$ data sources, the goal is to output a classif...
Rafael Hanashiro, Abhishek Shetty, Patrick Jaillet
Empirically Calibrated Conditional Independence Tests
Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guar...
Milleno Pan, Antoine de Mathelin, Wesley Tansey
From Perception to Action: An Interactive Benchmark for Vision Reasoning
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM)...
Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dyla...
International AI Safety Report 2026
The International AI Safety Report 2026 synthesises the current scientific evidence on the capabilities, emerging risks, and safety of general-purpose AI systems. The report series was mandated by ...
Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray,...
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints. Whi...
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao, Binbin Cao, Jian Wu, Yuning Jiang
HINORA II: Testing the Existence of the Council of Giants in ΛCDM simulations
The discovery of the galaxy ring known as the Council of Giants (CoG) highlights the need to explain such structures in the Local Universe. In the first paper of this series we presented HINORA - a...
Edward Olex, Alexander Knebe, Noam I. Libeskind, Stefan Gottlöber, Dmitry I. Makarov
Constraints on dynamically-formed massive black holes in Little Red Dots from X-ray non-detections
The existence of massive, compact galaxies (Little Red Dots, LRDs) at $z \sim 2$ challenges early structure formation models, suggesting rapid stellar and black hole (BH) assembly. While LRDs are e...
M. Liempi, D. R. G. Schleicher, M. A. Latif, R. Schneider, F. Flammini Dotti, A. Escala, M. C. Ve...
VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference ...
Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You
Characterization-free classification and identification of the environment between two quantum players
Classifying the causal structure of quantum channels is essential for verifying quantum networks and certifying quantum resources. We introduce a characterization-free protocol enabling two isolate...
Masahito Hayashi, Longyang Cao, Baichu Yu, Yuan-Yuan Zhao
Generative Pseudo-Labeling for Pre-Ranking with LLMs
Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discre...
Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang, Binbin Cao, Jian Wu, Yuning Jiang
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we ...
Christian Simon, MAsato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhon...
Toward an Agentic Infused Software Ecosystem
Fully leveraging the capabilities of AI agents in software development requires a rethinking of the software ecosystem itself. To this end, this paper outlines the creation of an Agentic Infused So...
Mark Marron
Evaluating Proactive Risk Awareness of Large Language Models
As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended...
Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu, Jing Li, Ruifeng Xu
Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving
To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing dat...
Yuliang Ji, Fuchen Shen, Jian Wu, Qiujie Xie, Yue Zhang
Are Multimodal Large Language Models Good Annotators for Image Tagging?
Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Languag...
Ming-Kun Xie, Jia-Hao Xiao, Zhiqiang Kou, Zhongnian Li, Gang Niu, Masashi Sugiyama
Does Order Matter : Connecting The Law of Robustness to Robust Generalization
Bubeck and Sellke (2021) pose as an open problem the connection between the law of robustness and robust generalization. The law of robustness states that overparameterization is necessary for mode...
Himadri Mandal, Vishnu Varadarajan, Jaee Ponde, Aritra Das, Mihir More, Debayan Gupta
Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models
This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and ...
Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase