Research

Paper

AI LLM March 11, 2026

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Authors

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao

Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

Metadata

arXiv ID: 2603.10521

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-03-11

Fetched: 2026-03-12 04:21

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.10521v1</id>\n    <title>IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs</title>\n    <updated>2026-03-11T08:27:09Z</updated>\n    <link href='https://arxiv.org/abs/2603.10521v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.10521v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CR'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-11T08:27:09Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Chuan Guo</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Juan Felipe Ceron Uribe</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Sicheng Zhu</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Christopher A. Choquette-Choo</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Steph Lin</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Nikhil Kandpal</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Milad Nasr</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name> Rai</name>\n      <arxiv:affiliation>Michael Pokorny</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Sam Toyer</name>\n    </author>\n    <author>\n      <name>Miles Wang</name>\n    </author>\n    <author>\n      <name>Yaodong Yu</name>\n    </author>\n    <author>\n      <name>Alex Beutel</name>\n    </author>\n    <author>\n      <name>Kai Xiao</name>\n    </author>\n  </entry>"
}