Paper
The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training
Authors
Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang
Abstract
Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.
Metadata
Related papers
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30
Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books
Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30
RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems
Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.10444v1</id>\n <title>The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training</title>\n <updated>2026-03-11T05:59:12Z</updated>\n <link href='https://arxiv.org/abs/2603.10444v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.10444v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <published>2026-03-11T05:59:12Z</published>\n <arxiv:primary_category term='cs.LG'/>\n <author>\n <name>Hengjie Cao</name>\n </author>\n <author>\n <name>Zhendong Huang</name>\n </author>\n <author>\n <name>Mengyi Chen</name>\n </author>\n <author>\n <name>Yifeng Yang</name>\n </author>\n <author>\n <name>Fanqi Yu</name>\n </author>\n <author>\n <name>Ruijun Huang</name>\n </author>\n <author>\n <name>Fang Dong</name>\n </author>\n <author>\n <name>Xin Zhang</name>\n </author>\n <author>\n <name>Jixian Zhou</name>\n </author>\n <author>\n <name>Anrui Chen</name>\n </author>\n <author>\n <name>Mingzhi Dong</name>\n </author>\n <author>\n <name>Yujiang Wang</name>\n </author>\n <author>\n <name>Jinlong Hou</name>\n </author>\n <author>\n <name>Qin Lv</name>\n </author>\n <author>\n <name>Yuan Cheng</name>\n </author>\n <author>\n <name>Tun Lu</name>\n </author>\n <author>\n <name>Fan Yang</name>\n </author>\n <author>\n <name>Li Shang</name>\n </author>\n </entry>"
}