Research

Paper

AI LLM March 09, 2026

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Authors

Yonghong Deng, Zhen Yang, Ping Jian, Xinyue Zhang, Zhongbin Guo, Chengzhi Li

Abstract

With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.

Metadata

arXiv ID: 2603.08234

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-03-09

Fetched: 2026-03-10 05:43

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.08234v1</id>\n    <title>The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs</title>\n    <updated>2026-03-09T11:03:45Z</updated>\n    <link href='https://arxiv.org/abs/2603.08234v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.08234v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <published>2026-03-09T11:03:45Z</published>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Yonghong Deng</name>\n    </author>\n    <author>\n      <name>Zhen Yang</name>\n    </author>\n    <author>\n      <name>Ping Jian</name>\n    </author>\n    <author>\n      <name>Xinyue Zhang</name>\n    </author>\n    <author>\n      <name>Zhongbin Guo</name>\n    </author>\n    <author>\n      <name>Chengzhi Li</name>\n    </author>\n  </entry>"
}