Research

Paper

AI LLM February 25, 2026

IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs

Authors

Chris Egersdoerfer, Arnav Sareen, Jean Luca Bez, Suren Byna, Dongkuan, Xu, Dong Dai

Abstract

As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions.In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to ask targeted follow-up questions. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Using this test suite, we conducted extensive evaluations, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.

Metadata

arXiv ID: 2602.22017
Provider: ARXIV
Primary Category: cs.DC
Published: 2026-02-25
Fetched: 2026-02-26 05:00

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.22017v1</id>\n    <title>IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs</title>\n    <updated>2026-02-25T15:30:55Z</updated>\n    <link href='https://arxiv.org/abs/2602.22017v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.22017v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions.In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to ask targeted follow-up questions. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Using this test suite, we conducted extensive evaluations, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.DC'/>\n    <published>2026-02-25T15:30:55Z</published>\n    <arxiv:comment>Published in the Proceedings of the 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2025)</arxiv:comment>\n    <arxiv:primary_category term='cs.DC'/>\n    <author>\n      <name>Chris Egersdoerfer</name>\n      <arxiv:affiliation>DK</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Arnav Sareen</name>\n      <arxiv:affiliation>DK</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Jean Luca Bez</name>\n      <arxiv:affiliation>DK</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Suren Byna</name>\n      <arxiv:affiliation>DK</arxiv:affiliation>\n    </author>\n    <author>\n      <name> Dongkuan</name>\n      <arxiv:affiliation>DK</arxiv:affiliation>\n    </author>\n    <author>\n      <name> Xu</name>\n    </author>\n    <author>\n      <name>Dong Dai</name>\n    </author>\n    <arxiv:doi>10.1109/IPDPS64566.2025.00036</arxiv:doi>\n    <link href='https://doi.org/10.1109/IPDPS64566.2025.00036' rel='related' title='doi'/>\n  </entry>"
}