Research

Paper

AI LLM March 23, 2026

On the Challenges and Opportunities of Learned Sparse Retrieval for Code

Authors

Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, Stéphane Clinchant

Abstract

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

Metadata

arXiv ID: 2603.22008

Provider: ARXIV

Primary Category: cs.IR

Published: 2026-03-23

Fetched: 2026-03-24 06:02

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.22008v1</id>\n    <title>On the Challenges and Opportunities of Learned Sparse Retrieval for Code</title>\n    <updated>2026-03-23T14:14:08Z</updated>\n    <link href='https://arxiv.org/abs/2603.22008v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.22008v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.IR'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-23T14:14:08Z</published>\n    <arxiv:comment>15 pages, 5 figures, 12 tables</arxiv:comment>\n    <arxiv:primary_category term='cs.IR'/>\n    <author>\n      <name>Simon Lupart</name>\n    </author>\n    <author>\n      <name>Maxime Louis</name>\n    </author>\n    <author>\n      <name>Thibault Formal</name>\n    </author>\n    <author>\n      <name>Hervé Déjean</name>\n    </author>\n    <author>\n      <name>Stéphane Clinchant</name>\n    </author>\n  </entry>"
}