Papers
Research papers from arXiv and related sources
Highly Autonomous Cyber-Capable Agents: Anticipating Capabilities, Tactics, and Strategic Implications
This report introduces the concept of "Highly Autonomous Cyber-Capable Agents" (HACCAs), AI systems capable of autonomously conducting multi-stage cyber campaigns at a level comparable to today's t...
Jam Kraprayoon, Shaun Ee, Brianna Rosen, Yohan Matthew, Aditya Singh, Christopher Covino, Asher B...
Managing Cognitive Bias in Human Labeling Operations for Rare-Event AI: Evidence from a Field Experiment
Many operational AI systems depend on large-scale human annotation to detect rare but consequential events (e.g., fraud, defects, and medical abnormalities). When positives are rare, the prevalence...
Gunnar P. Epping, Andrew Caplin, Erik Duhaime, William R. Holmes, Daniel Martin, Jennifer S. True...
Tiny Aya: Bridging Scale and Multilingual Depth
Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, ...
Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Vi...
KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation
Graph-based Retrieval-Augmented Generation (GraphRAG) constructs the Knowledge Graph (KG) from external databases to enhance the timeliness and accuracy of Large Language Model (LLM) generations.Ho...
Qizhi Chen, Chao Qi, Yihong Huang, Muquan Li, Rongzheng Wang, Dongyang Zhang, Ke Qin, Shuang Liang
Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs
Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-c...
Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du, Dacheng Tao
PRMB: Benchmarking Reward Models in Long-Horizon CBT-based Counseling Dialogue
Large language models (LLMs) hold potential for mental healthcare applications, particularly in cognitive behavioral therapy (CBT)-based counseling, where reward models play a critical role in alig...
Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Liang He
INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world ...
Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Time Series Event Detection (TSED) has long been an important task with critical applications across many high-stakes domains. Unlike statistical anomalies, events are defined by semantics with com...
Sky Chenwei Wan, Tianjun Hou, Yifei Wang, Xiqing Chang, Aymeric Jan
Deep Learning Network-Temporal Models For Traffic Prediction
Time series analysis is critical for emerging net- work intelligent control and management functions. However, existing statistical-based and shallow machine learning models have shown limited pred...
Yufeng Xin, Ethan Fan
COMIC: Agentic Sketch Comedy Generation
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of...
Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
Chasing RATs: Tracing Reading for and as Creative Activity
Creativity research has privileged making over the interpretive labor that precedes and shapes it. We introduce Reading Activity Traces (RATs), a proposal that treats reading -- broadly defined to ...
Sophia Liu, Shm Garanganao Almeda
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that ch...
Mingyang Song, Mao Zheng, Chenning Xu
LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce
Rising environmental awareness in e-commerce necessitates recommender systems that not only guide users to sustainable products but also minimize their own digital carbon footprints. Traditional se...
Hao N. Nguyen, Hieu M. Nguyen, Son Van Nguyen, Nguyen Thi Hanh
Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of...
Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Elias Ste...
Leech Lattice Vector Quantization for Efficient LLM Compression
Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters ...
Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel
Task-Aware Delegation Cues for LLM Agents
LLM agents increasingly present as conversational collaborators, yet human--agent teamwork remains brittle due to information asymmetry: users lack task-specific reliability cues, and agents rarely...
Xingrui Gu
A Systematic Study of Pseudo-Relevance Feedback with LLMs
Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from...
Nour Jedidi, Jimmy Lin
RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation
Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inf...
Patricia Paskov, Kevin Wei, Shen Zhou Hong, Dan Bateyko, Xavier Roberts-Gaal, Carson Ezell, Gaili...
Artificial Intelligence as a Catalyst for Innovation in Software Engineering
The rapid evolution and inherent complexity of modern software requirements demand highly flexible and responsive development methodologies. While Agile frameworks have become the industry standard...
Carlos Alberto Fernández-y-Fernández, Jorge R. Aguilar-Cisneros
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is pa...
Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei