Paper
RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
Authors
Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang
Abstract
Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.16453v1</id>\n <title>RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments</title>\n <updated>2026-03-17T12:35:52Z</updated>\n <link href='https://arxiv.org/abs/2603.16453v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.16453v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge.\n We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions.\n We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution.\n Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <published>2026-03-17T12:35:52Z</published>\n <arxiv:primary_category term='cs.AI'/>\n <author>\n <name>Linghua Zhang</name>\n </author>\n <author>\n <name>Jun Wang</name>\n </author>\n <author>\n <name>Jingtong Wu</name>\n </author>\n <author>\n <name>Zhisong Zhang</name>\n </author>\n </entry>"
}