Research

Paper

AI LLM February 26, 2026

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Authors

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

Abstract

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

Metadata

arXiv ID: 2602.23197
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-02-26
Fetched: 2026-02-27 04:35

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.23197v1</id>\n    <title>Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models</title>\n    <updated>2026-02-26T16:49:15Z</updated>\n    <link href='https://arxiv.org/abs/2602.23197v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.23197v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='stat.ML'/>\n    <published>2026-02-26T16:49:15Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Chungpa Lee</name>\n    </author>\n    <author>\n      <name>Jy-yong Sohn</name>\n    </author>\n    <author>\n      <name>Kangwook Lee</name>\n    </author>\n  </entry>"
}