Research

Papers

Research papers from arXiv and related sources

Total: 4513 AI/LLM: 2483 Testing: 2030
AI LLM

Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains chall...

Vishal Patil, Shree Vaishnavi Bacha, Revanth Yamani, Yidan Sun, Mayank Kejriwal

2602.21082 2026-02-24
AI LLM

Tool Building as a Path to "Superintelligence"

The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $...

David Koplow, Tomer Galanti, Tomaso Poggio

2602.21061 2026-02-24
AI LLM

An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are prim...

Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur

2602.21059 2026-02-24
AI LLM

VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate th...

Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li

2602.21054 2026-02-24
AI LLM

PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtl...

Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur

2602.21045 2026-02-24
AI LLM

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning proble...

Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu

2602.21044 2026-02-24
AI LLM

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM)...

Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dyla...

2602.21015 2026-02-24
AI LLM

International AI Safety Report 2026

The International AI Safety Report 2026 synthesises the current scientific evidence on the capabilities, emerging risks, and safety of general-purpose AI systems. The report series was mandated by ...

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray,...

2602.21012 2026-02-24
AI LLM

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference ...

Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You

2602.20999 2026-02-24
AI LLM

Generative Pseudo-Labeling for Pre-Ranking with LLMs

Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discre...

Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang, Binbin Cao, Jian Wu, Yuning Jiang

2602.20995 2026-02-24
AI LLM

Toward an Agentic Infused Software Ecosystem

Fully leveraging the capabilities of AI agents in software development requires a rethinking of the software ecosystem itself. To this end, this paper outlines the creation of an Agentic Infused So...

Mark Marron

2602.20979 2026-02-24
AI LLM

Evaluating Proactive Risk Awareness of Large Language Models

As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended...

Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu, Jing Li, Ruifeng Xu

2602.20976 2026-02-24
AI LLM

Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving

To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing dat...

Yuliang Ji, Fuchen Shen, Jian Wu, Qiujie Xie, Yue Zhang

2602.20973 2026-02-24
AI LLM

Are Multimodal Large Language Models Good Annotators for Image Tagging?

Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Languag...

Ming-Kun Xie, Jia-Hao Xiao, Zhiqiang Kou, Zhongnian Li, Gang Niu, Masashi Sugiyama

2602.20972 2026-02-24
AI LLM

Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and ...

Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase

2602.20966 2026-02-24
AI LLM

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce art...

Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee, Dongmin Park

2602.20951 2026-02-24
AI LLM

Some Simple Economics of AGI

For millennia, human cognition was the primary engine of progress on Earth. As AI decouples cognition from biology, the marginal cost of measurable execution falls to zero, absorbing any labor capt...

Christian Catalini, Xiang Hui, Jane Wu

2602.20946 2026-02-24
AI LLM

The Art of Efficient Reasoning: Data, Reward, and Optimization

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to...

Taiqiang Wu, Zenan Zu, Bo Zhou, Ngai Wong

2602.20945 2026-02-24
AI LLM

Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scalin...

ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao

2602.20934 2026-02-24
AI LLM

HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG

Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations, limiting their reliability in knowledge-intensive tasks. While Retrieval-Augmented Generation (RAG...

Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu, Xiaoxing Wang, Junchi Yan

2602.20926 2026-02-24