Research

Paper

AI LLM March 02, 2026

Learning from Synthetic Data Improves Multi-hop Reasoning

Authors

Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger

Abstract

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.

Metadata

arXiv ID: 2603.02091
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-03-02
Fetched: 2026-03-03 04:34

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.02091v1</id>\n    <title>Learning from Synthetic Data Improves Multi-hop Reasoning</title>\n    <updated>2026-03-02T17:08:43Z</updated>\n    <link href='https://arxiv.org/abs/2603.02091v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.02091v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-02T17:08:43Z</published>\n    <arxiv:comment>Accepted to ICLR 2026</arxiv:comment>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Anmol Kabra</name>\n    </author>\n    <author>\n      <name>Yilun Yin</name>\n    </author>\n    <author>\n      <name>Albert Gong</name>\n    </author>\n    <author>\n      <name>Kamilė Stankevičiūtė</name>\n    </author>\n    <author>\n      <name>Dongyoung Go</name>\n    </author>\n    <author>\n      <name>Johann Lee</name>\n    </author>\n    <author>\n      <name>Katie Z. Luo</name>\n    </author>\n    <author>\n      <name>Carla P. Gomes</name>\n    </author>\n    <author>\n      <name>Kilian Q. Weinberger</name>\n    </author>\n  </entry>"
}