Research

Paper

AI LLM March 03, 2026

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Authors

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

Abstract

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

Metadata

arXiv ID: 2603.02668

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-03-03

Fetched: 2026-03-04 03:41

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.02668v1</id>\n    <title>SorryDB: Can AI Provers Complete Real-World Lean Theorems?</title>\n    <updated>2026-03-03T06:55:15Z</updated>\n    <link href='https://arxiv.org/abs/2603.02668v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.02668v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-03T06:55:15Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Austin Letson</name>\n    </author>\n    <author>\n      <name>Leopoldo Sarra</name>\n    </author>\n    <author>\n      <name>Auguste Poiroux</name>\n    </author>\n    <author>\n      <name>Oliver Dressler</name>\n    </author>\n    <author>\n      <name>Paul Lezeau</name>\n    </author>\n    <author>\n      <name>Dhyan Aranha</name>\n    </author>\n    <author>\n      <name>Frederick Pu</name>\n    </author>\n    <author>\n      <name>Aaron Hill</name>\n    </author>\n    <author>\n      <name>Miguel Corredera Hidalgo</name>\n    </author>\n    <author>\n      <name>Julian Berman</name>\n    </author>\n    <author>\n      <name>George Tsoukalas</name>\n    </author>\n    <author>\n      <name>Lenny Taelman</name>\n    </author>\n  </entry>"
}