Research

Paper

AI LLM March 03, 2026

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Authors

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu

Abstract

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Metadata

arXiv ID: 2603.03081

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-03

Fetched: 2026-03-04 03:41

Related papers

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jian... • 2026-03-30

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or • 2026-03-30

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khan... • 2026-03-30

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath • 2026-03-30

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam • 2026-03-30

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.03081v1</id>\n    <title>TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models</title>\n    <updated>2026-03-03T15:25:53Z</updated>\n    <link href='https://arxiv.org/abs/2603.03081v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.03081v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\\% in certain scenarios.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-03T15:25:53Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Zhi Xu</name>\n    </author>\n    <author>\n      <name>Jiaqi Li</name>\n    </author>\n    <author>\n      <name>Xiaotong Zhang</name>\n    </author>\n    <author>\n      <name>Hong Yu</name>\n    </author>\n    <author>\n      <name>Han Liu</name>\n    </author>\n  </entry>"
}