Research

Paper

AI LLM February 19, 2026

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Authors

Kishan Maharaj, Nandakishore Menon, Ashita Saxena, Srikanth Tamilselvam

Abstract

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.

Metadata

arXiv ID: 2602.17183

Provider: ARXIV

Primary Category: cs.SE

Published: 2026-02-19

Fetched: 2026-02-21 18:51

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.17183v1</id>\n    <title>Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering</title>\n    <updated>2026-02-19T09:05:03Z</updated>\n    <link href='https://arxiv.org/abs/2602.17183v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.17183v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SE'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-02-19T09:05:03Z</published>\n    <arxiv:comment>11 pages, 4 Figures, 5 Tables, Work in Progress</arxiv:comment>\n    <arxiv:primary_category term='cs.SE'/>\n    <author>\n      <name>Kishan Maharaj</name>\n    </author>\n    <author>\n      <name>Nandakishore Menon</name>\n    </author>\n    <author>\n      <name>Ashita Saxena</name>\n    </author>\n    <author>\n      <name>Srikanth Tamilselvam</name>\n    </author>\n  </entry>"
}