Research

Paper

AI LLM February 25, 2026

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Authors

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu

Abstract

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

Metadata

arXiv ID: 2602.21496
Provider: ARXIV
Primary Category: cs.AI
Published: 2026-02-25
Fetched: 2026-02-26 05:00

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.21496v1</id>\n    <title>Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information</title>\n    <updated>2026-02-25T02:09:23Z</updated>\n    <link href='https://arxiv.org/abs/2602.21496v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.21496v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic \"Editor\" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-02-25T02:09:23Z</published>\n    <arxiv:comment>Under Review</arxiv:comment>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Umid Suleymanov</name>\n    </author>\n    <author>\n      <name>Zaur Rajabov</name>\n    </author>\n    <author>\n      <name>Emil Mirzazada</name>\n    </author>\n    <author>\n      <name>Murat Kantarcioglu</name>\n    </author>\n  </entry>"
}