Research

Paper

AI LLM February 20, 2026

The Statistical Signature of LLMs

Authors

Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi

Abstract

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

Metadata

arXiv ID: 2602.18152
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-02-20
Fetched: 2026-02-23 05:33

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.18152v1</id>\n    <title>The Statistical Signature of LLMs</title>\n    <updated>2026-02-20T11:33:37Z</updated>\n    <link href='https://arxiv.org/abs/2602.18152v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.18152v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CY'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='physics.soc-ph'/>\n    <published>2026-02-20T11:33:37Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Ortal Hadad</name>\n    </author>\n    <author>\n      <name>Edoardo Loru</name>\n    </author>\n    <author>\n      <name>Jacopo Nudo</name>\n    </author>\n    <author>\n      <name>Niccolò Di Marco</name>\n    </author>\n    <author>\n      <name>Matteo Cinelli</name>\n    </author>\n    <author>\n      <name>Walter Quattrociocchi</name>\n    </author>\n  </entry>"
}