Research

Paper

TESTING March 17, 2026

Alignment Makes Language Models Normative, Not Descriptive

Authors

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Abstract

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

Metadata

arXiv ID: 2603.17218
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-17
Fetched: 2026-03-19 06:01

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.17218v1</id>\n    <title>Alignment Makes Language Models Normative, Not Descriptive</title>\n    <updated>2026-03-17T23:47:08Z</updated>\n    <link href='https://arxiv.org/abs/2603.17218v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.17218v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.GT'/>\n    <published>2026-03-17T23:47:08Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Eilam Shapira</name>\n    </author>\n    <author>\n      <name>Moshe Tennenholtz</name>\n    </author>\n    <author>\n      <name>Roi Reichart</name>\n    </author>\n  </entry>"
}