Research

Paper

AI LLM March 24, 2026

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Authors

Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen

Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% $\to$ 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

Metadata

arXiv ID: 2603.23268

Provider: ARXIV

Primary Category: cs.LG

Published: 2026-03-24

Fetched: 2026-03-25 06:02

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.23268v1</id>\n    <title>SafeSeek: Universal Attribution of Safety Circuits in Language Models</title>\n    <updated>2026-03-24T14:32:53Z</updated>\n    <link href='https://arxiv.org/abs/2603.23268v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.23268v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \\ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \\ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \\ourmethod in two key scenarios in LLM safety: \\textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\\% $\\to$ 0.4\\% while retaining over 99\\% general utility; \\textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\\% heads and 0.79\\% neurons, whose removal spikes ASR from 0.8\\% $\\to$ 96.9\\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\\% safety retention.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-24T14:32:53Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Miao Yu</name>\n    </author>\n    <author>\n      <name>Siyuan Fu</name>\n    </author>\n    <author>\n      <name>Moayad Aloqaily</name>\n    </author>\n    <author>\n      <name>Zhenhong Zhou</name>\n    </author>\n    <author>\n      <name>Safa Otoum</name>\n    </author>\n    <author>\n      <name>Xing fan</name>\n    </author>\n    <author>\n      <name>Kun Wang</name>\n    </author>\n    <author>\n      <name>Yufei Guo</name>\n    </author>\n    <author>\n      <name>Qingsong Wen</name>\n    </author>\n  </entry>"
}