Papers
Research papers from arXiv and related sources
TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on...
Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian
Face Time Traveller : Travel Through Ages Without Losing Identity
Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both ide...
Purbayan Kar, Ayush Ghadiya, Vishal Chudasama, Pankaj Wasnik, C. V. Jawahar
When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design
Agentic AI increasingly intervenes proactively by inferring users' situations from contextual data yet often fails for lack of principled judgment about when, why, and whether to act. We address th...
Soyoung Jung, Daehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park
Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching
Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperativ...
Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura
PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning
With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on...
Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, Tianfan Xue
MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks
Despite the remarkable progress of large language models (LLMs), the capabilities of standalone LLMs have begun to plateau when tackling real-world, complex tasks that require interaction with exte...
Shiqian Su, Sen Xing, Xuan Dong, Muyan Zhong, Bin Wang, Xizhou Zhu, Yuntao Chen, Wenhai Wang, Yue...
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generation...
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim
Probing for Knowledge Attribution in Large Language Models
Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality viol...
Ivo Brink, Alexander Boer, Dennis Ulmer
ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making
Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unneces...
Yusuke Watanabe, Yohei Kobashi, Takeshi Kojima, Yusuke Iwasawa, Yasushi Okuno, Yutaka Matsuo
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a sign...
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhila...
Towards Better RL Training Data Utilization via Second-Order Rollout
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only f...
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui
Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents
The Rust programming language presents a steep learning curve and significant coding challenges, making the automation of issue resolution essential for its broader adoption. Recently, LLM-powered ...
Jiahong Xiang, Wenxiao He, Xihua Wang, Hongliang Tian, Yuqun Zhang
An AI-Based Structured Semantic Control Model for Stable and Coherent Dynamic Interactive Content Generation
This study addresses the challenge that generative models struggle to balance flexibility, stability, and controllability in complex interactive scenarios. It proposes a controllable generation fra...
Rui Liu
Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study
Training large language models (LLMs) requires substantial compute and energy. At the same time, renewable energy sources regularly produce more electricity than the grid can absorb, leading to cur...
Philipp Wiesner, Soeren Becker, Brett Cornick, Dominik Scheinert, Alexander Acker, Odej Kao
Decomposing Physician Disagreement in HealthBench
We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 1...
Satya Borgohain, Roy Mariathas
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophant...
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Mark...
Measurements of branching fractions of $Λ_{c}^{+}\toΣ^{0}K_{S}^{0}π^{+}$ and $Λ_{c}^{+}\toΣ^{0}K_{S}^{0}K^{+}$
Based on a data sample corresponding to an integrated luminosity of 6.4~fb$^{-1}$ of $e^+e^-$ annihilation and collected with the BESIII detector at 13 center-of-mass energy points ranging between ...
BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, C. S. Akondi, R. Alibert...
Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Condit...
Nils Schwager, Simon Münker, Alistair Plum, Achim Rettinger
The Inference Bottleneck: Antitrust and Neutrality Duties in the Age of Cognitive Infrastructure
As generative AI commercializes, competitive advantage is shifting from one-time model training toward continuous inference, distribution, and routing. At the frontier, large-scale inference can fu...
Gaston Besanson, Marcelo Celani
SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation
Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement frame...
Fengming Liu, Tat-Jen Cham, Chuanxia Zheng