Research

Paper

AI LLM February 19, 2026

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Authors

Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

Abstract

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

Metadata

arXiv ID: 2602.17038

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-02-19

Fetched: 2026-02-21 18:51

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.17038v1</id>\n    <title>Phase-Aware Mixture of Experts for Agentic Reinforcement Learning</title>\n    <updated>2026-02-19T03:18:30Z</updated>\n    <link href='https://arxiv.org/abs/2602.17038v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.17038v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \\emph{single} policy network, causing \\emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \\textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \\emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-02-19T03:18:30Z</published>\n    <arxiv:comment>16 pages</arxiv:comment>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Shengtian Yang</name>\n      <arxiv:affiliation>Southeast University</arxiv:affiliation>\n      <arxiv:affiliation>Kuaishou Technology</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Yu Li</name>\n      <arxiv:affiliation>Southeast University</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Shuo He</name>\n      <arxiv:affiliation>Nanyang Technological University</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Yewen Li</name>\n      <arxiv:affiliation>Kuaishou Technology</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Qingpeng Cai</name>\n      <arxiv:affiliation>Kuaishou Technology</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Peng Jiang</name>\n      <arxiv:affiliation>Kuaishou Technology</arxiv:affiliation>\n    </author>\n    <author>\n      <name>Lei Feng</name>\n      <arxiv:affiliation>Southeast University</arxiv:affiliation>\n    </author>\n  </entry>"
}