Research

Paper

AI LLM February 23, 2026

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Authors

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen

Abstract

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.

Metadata

arXiv ID: 2602.19517
Provider: ARXIV
Primary Category: cs.AI
Published: 2026-02-23
Fetched: 2026-02-24 04:38

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.19517v1</id>\n    <title>Classroom Final Exam: An Instructor-Tested Reasoning Benchmark</title>\n    <updated>2026-02-23T05:17:41Z</updated>\n    <link href='https://arxiv.org/abs/2602.19517v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.19517v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We introduce \\CFE{} (\\textbf{C}lassroom \\textbf{F}inal \\textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \\CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \\CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CE'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-02-23T05:17:41Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Chongyang Gao</name>\n    </author>\n    <author>\n      <name>Diji Yang</name>\n    </author>\n    <author>\n      <name>Shuyan Zhou</name>\n    </author>\n    <author>\n      <name>Xichen Yan</name>\n    </author>\n    <author>\n      <name>Luchuan Song</name>\n    </author>\n    <author>\n      <name>Shuo Li</name>\n    </author>\n    <author>\n      <name>Kezhen Chen</name>\n    </author>\n  </entry>"
}