Paper
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Authors
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen
Abstract
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.19517v1</id>\n <title>Classroom Final Exam: An Instructor-Tested Reasoning Benchmark</title>\n <updated>2026-02-23T05:17:41Z</updated>\n <link href='https://arxiv.org/abs/2602.19517v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.19517v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>We introduce \\CFE{} (\\textbf{C}lassroom \\textbf{F}inal \\textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \\CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \\CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CE'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n <published>2026-02-23T05:17:41Z</published>\n <arxiv:primary_category term='cs.AI'/>\n <author>\n <name>Chongyang Gao</name>\n </author>\n <author>\n <name>Diji Yang</name>\n </author>\n <author>\n <name>Shuyan Zhou</name>\n </author>\n <author>\n <name>Xichen Yan</name>\n </author>\n <author>\n <name>Luchuan Song</name>\n </author>\n <author>\n <name>Shuo Li</name>\n </author>\n <author>\n <name>Kezhen Chen</name>\n </author>\n </entry>"
}