Paper
Physics-Aware, Shannon-Optimal Compression via Arithmetic Coding for Distributional Fidelity
Authors
Cristiano Fanelli
Abstract
Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to produce synthetic datasets whose fidelity must be rigorously validated against the original data on which they are trained, a task made more challenging by the continued growth in data volume and problem dimensionality. In this work, we propose the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation. Datasets that share the same underlying physical correlations admit comparable optimal descriptions, while discrepancies in those correlations-arising from miscalibration, mismodeling, or bias-manifest as an irreducible excess in code length. This excess codelength defines an operational fidelity metric, quantified directly in bits through differences in achievable compression length relative to a physics-inspired reference distribution. We demonstrate that this metric is global, interpretable, additive across components, and asymptotically optimal in the Shannon sense. Moreover, we show that differences in codelength correspond to differences in expected negative log-likelihood evaluated under the same physics-informed reference model. As a byproduct, we also demonstrate that our compression approach achieves a higher compression ratio than traditional general-purpose algorithms such as gzip. Our results establish lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.
Metadata
Related papers
Fractal universe and quantum gravity made simple
Fabio Briscese, Gianluca Calcagni • 2026-03-25
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25
Orientation Reconstruction of Proteins using Coulomb Explosions
Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25
The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.19476v1</id>\n <title>Physics-Aware, Shannon-Optimal Compression via Arithmetic Coding for Distributional Fidelity</title>\n <updated>2026-02-23T03:39:21Z</updated>\n <link href='https://arxiv.org/abs/2602.19476v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.19476v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to produce synthetic datasets whose fidelity must be rigorously validated against the original data on which they are trained, a task made more challenging by the continued growth in data volume and problem dimensionality. In this work, we propose the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation. Datasets that share the same underlying physical correlations admit comparable optimal descriptions, while discrepancies in those correlations-arising from miscalibration, mismodeling, or bias-manifest as an irreducible excess in code length. This excess codelength defines an operational fidelity metric, quantified directly in bits through differences in achievable compression length relative to a physics-inspired reference distribution. We demonstrate that this metric is global, interpretable, additive across components, and asymptotically optimal in the Shannon sense. Moreover, we show that differences in codelength correspond to differences in expected negative log-likelihood evaluated under the same physics-informed reference model. As a byproduct, we also demonstrate that our compression approach achieves a higher compression ratio than traditional general-purpose algorithms such as gzip. Our results establish lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.IT'/>\n <category scheme='http://arxiv.org/schemas/atom' term='physics.data-an'/>\n <published>2026-02-23T03:39:21Z</published>\n <arxiv:comment>13 pages, 5 figures</arxiv:comment>\n <arxiv:primary_category term='cs.IT'/>\n <author>\n <name>Cristiano Fanelli</name>\n </author>\n </entry>"
}