Paper
Examining LLMs Ability to Summarize Code Through Mutation-Analysis
Authors
Lara Khatib, Micheal Pu, Bogdan Vasilescu, Meiyappan Nagappan
Abstract
As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code's logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation-summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as "bugs", both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.17838v1</id>\n <title>Examining LLMs Ability to Summarize Code Through Mutation-Analysis</title>\n <updated>2026-02-19T21:00:49Z</updated>\n <link href='https://arxiv.org/abs/2602.17838v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.17838v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code's logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation-summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as \"bugs\", both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.SE'/>\n <published>2026-02-19T21:00:49Z</published>\n <arxiv:primary_category term='cs.SE'/>\n <author>\n <name>Lara Khatib</name>\n </author>\n <author>\n <name>Micheal Pu</name>\n </author>\n <author>\n <name>Bogdan Vasilescu</name>\n </author>\n <author>\n <name>Meiyappan Nagappan</name>\n </author>\n </entry>"
}