Research

Paper

AI LLM March 17, 2026

TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation

Authors

Zhihao Gong, Zeyu Sun, Dong Huang, Qingyuan Liang, Jie M. Zhang, Dan Hao

Abstract

While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.

Metadata

arXiv ID: 2603.16479

Provider: ARXIV

Primary Category: cs.SE

Published: 2026-03-17

Fetched: 2026-03-18 06:02

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.16479v1</id>\n    <title>TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation</title>\n    <updated>2026-03-17T13:05:54Z</updated>\n    <link href='https://arxiv.org/abs/2603.16479v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.16479v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \\textit{execution efficiency} remains overlooked. We present \\textbf{\\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \\textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \\textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \\textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \\textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\\%), language construct mismatches (66.4\\%), and resource mismanagement (21.7\\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \\textsc{trace} as a principled foundation for efficiency-oriented evaluation.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SE'/>\n    <published>2026-03-17T13:05:54Z</published>\n    <arxiv:primary_category term='cs.SE'/>\n    <author>\n      <name>Zhihao Gong</name>\n    </author>\n    <author>\n      <name>Zeyu Sun</name>\n    </author>\n    <author>\n      <name>Dong Huang</name>\n    </author>\n    <author>\n      <name>Qingyuan Liang</name>\n    </author>\n    <author>\n      <name>Jie M. Zhang</name>\n    </author>\n    <author>\n      <name>Dan Hao</name>\n    </author>\n  </entry>"
}