Research

Paper

AI LLM March 23, 2026

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Authors

Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu

Abstract

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.

Metadata

arXiv ID: 2603.22213
Provider: ARXIV
Primary Category: cs.LG
Published: 2026-03-23
Fetched: 2026-03-24 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.22213v1</id>\n    <title>SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection</title>\n    <updated>2026-03-23T17:11:43Z</updated>\n    <link href='https://arxiv.org/abs/2603.22213v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.22213v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-23T17:11:43Z</published>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Kexian Tang</name>\n    </author>\n    <author>\n      <name>Jiani Wang</name>\n    </author>\n    <author>\n      <name>Shaowen Wang</name>\n    </author>\n    <author>\n      <name>Kaifeng Lyu</name>\n    </author>\n  </entry>"
}