Paper
Flash-KMeans: Fast and Memory-Efficient Exact K-Means
Authors
Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion Stoica
Abstract
$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.
Metadata
Related papers
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30
Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books
Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30
RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems
Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.09229v1</id>\n <title>Flash-KMeans: Fast and Memory-Efficient Exact K-Means</title>\n <updated>2026-03-10T05:54:52Z</updated>\n <link href='https://arxiv.org/abs/2603.09229v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.09229v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \\times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\\times$ and over 200$\\times$, respectively.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n <published>2026-03-10T05:54:52Z</published>\n <arxiv:primary_category term='cs.DC'/>\n <author>\n <name>Shuo Yang</name>\n </author>\n <author>\n <name>Haocheng Xi</name>\n </author>\n <author>\n <name>Yilong Zhao</name>\n </author>\n <author>\n <name>Muyang Li</name>\n </author>\n <author>\n <name>Xiaoze Fan</name>\n </author>\n <author>\n <name>Jintao Zhang</name>\n </author>\n <author>\n <name>Han Cai</name>\n </author>\n <author>\n <name>Yujun Lin</name>\n </author>\n <author>\n <name>Xiuyu Li</name>\n </author>\n <author>\n <name>Kurt Keutzer</name>\n </author>\n <author>\n <name>Song Han</name>\n </author>\n <author>\n <name>Chenfeng Xu</name>\n </author>\n <author>\n <name>Ion Stoica</name>\n </author>\n </entry>"
}