Research

Paper

TESTING March 03, 2026

Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?

Authors

Xin Wang, Ge Wanying, Junichi Yamagishi

Abstract

Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.

Metadata

arXiv ID: 2603.02914
Provider: ARXIV
Primary Category: eess.AS
Published: 2026-03-03
Fetched: 2026-03-04 03:41

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.02914v1</id>\n    <title>Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?</title>\n    <updated>2026-03-03T12:13:53Z</updated>\n    <link href='https://arxiv.org/abs/2603.02914v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.02914v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='eess.AS'/>\n    <published>2026-03-03T12:13:53Z</published>\n    <arxiv:comment>Submitted to Interspeech 2026; put on arxiv based on requirement of paper open-access rule; quote from Interspeech: \"Interspeech no longer enforces an anonymity period for submissions. While uploading a version online is permitted, your official submission to Interspeech must not contain any author-identifying information\"</arxiv:comment>\n    <arxiv:primary_category term='eess.AS'/>\n    <author>\n      <name>Xin Wang</name>\n    </author>\n    <author>\n      <name>Ge Wanying</name>\n    </author>\n    <author>\n      <name>Junichi Yamagishi</name>\n    </author>\n  </entry>"
}