Research

Paper

AI LLM February 24, 2026

Overton Pluralistic Reinforcement Learning for Large Language Models

Authors

Yu Fu, Seongho Son, Ilija Bogunovic

Abstract

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

Metadata

arXiv ID: 2602.20759
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-02-24
Fetched: 2026-02-25 06:05

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.20759v1</id>\n    <title>Overton Pluralistic Reinforcement Learning for Large Language Models</title>\n    <updated>2026-02-24T10:39:27Z</updated>\n    <link href='https://arxiv.org/abs/2602.20759v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.20759v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a \"small models, big perspective coverage\" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-02-24T10:39:27Z</published>\n    <arxiv:comment>28 pages, 8 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Yu Fu</name>\n    </author>\n    <author>\n      <name>Seongho Son</name>\n    </author>\n    <author>\n      <name>Ilija Bogunovic</name>\n    </author>\n  </entry>"
}