Papers
Research papers from arXiv and related sources
CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS
Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the ch...
Zihao Zheng, Wen Wu, Chao Zhang, Mengyue Wu, Xuenan Xu
Adaptive Theory of Mind for LLM-based Multi-Agent Coordination
Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (L...
Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao, Chen Chu, Hao Guo, Danyang Jia, Zhen Wang, Shuy...
Human/AI Collective Intelligence for Deliberative Democracy: A Human-Centred Design Approach
This chapter introduces the concept of Collective Intelligence for Deliberative Democracy (CI4DD). We propose that the use of computational tools, specifically artificial intelligence to advance de...
Anna De Liddo, Lucas Anastasiou, Simon Buckingham Shum
When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition
Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Que...
Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu
Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models
Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box ...
Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai, Zhengyi Yang, Mengyu Zhou, Erchao Zhao...
Visual Prompt Discovery via Semantic Exploration
LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown pr...
Jaechang Kim, Yotaro Shimose, Zhao Wang, Kuang-Da Wang, Jungseul Ok, Shingo Takamatsu
How to Utilize Complementary Vision-Text Information for 2D Structure Understanding
LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capt...
Jiancheng Dong, Pengyue Jia, Derong Xu, Jiawei Cheng, Jingyu Peng, Chao Zhang, Bowen Liu, Xin Sun...
More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up ques...
Song Tae-Eun
Mixture-of-Depths Attention
Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually di...
Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, La...
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and int...
Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Fen...
HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel r...
Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, ...
Mechanistic Origin of Moral Indifference in Language Models
Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to...
Lingyu Li, Yan Teng, Yingchun Wang
Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion
Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. ...
Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He...
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing m...
Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu
Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning
Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewa...
Aozhe Wang, Yuchen Yan, Nan Zhou, Zhengxi Lu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval
Recent conversational memory systems invest heavily in LLM-based structuring at ingestion time and learned retrieval policies at query time. We show that neither is necessary. SmartSearch retrieves...
Jesper Derehag, Carlos Calva, Timmy Ghiurau
AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity...
Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industria...
Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen
Effective Distillation to Hybrid xLSTM Architectures
There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled ...
Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Piete...
LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs
Data movement overheads increase the inference latency of state-of-the-art large language models (LLMs). These models commonly use the bfloat16 (BF16) format for stable training. Floating-point sta...
Miao Sun, Alish Kanani, Kaushik Shroff, Umit Ogras