Research

Paper

TESTING March 12, 2026

Thermodynamic Descriptors from Molecular Dynamics as Machine Learning Features for Extrapolable Property Prediction

Authors

Nuria H. Espejo, Pablo Llombart, Andrés González de Castilla, Jorge Ramirez, Jorge R. Espinosa, Adiran Garaizar

Abstract

Machine learning (ML) models which rely on molecular structure excel at predicting properties for well-represented organic compounds, however their limited ability to extrapolate to chemotypes outside their training domain, remains a critical bottleneck in chemical discovery. This challenge is particularly acute in industrial discovery, where navigating uncharted chemical space to generate new intellectual property is a primary objective. Normal boiling points serve as a key benchmark for testing the extrapolative power of ML algorithms. A major limitation is that group-contribution methods are by design unable to generate predictions for molecules containing unparameterized fragments. Here, we demonstrate that this limitation can be overcome by replacing structural descriptors with thermodynamic properties computed directly from molecular dynamics simulations. We introduce a physics-augmented framework where a CatBoost regression model learns directly from ensemble-averaged cohesive energies, heats of vaporization, and densities extracted from atomistic liquid-phase simulations. Benchmark comparisons reveal that while both our physics-augmented model and conventional structure-based models perform comparably well on standard organic compounds, only the former maintains controlled error growth when extrapolating to structurally dissimilar chemical space. Our model successfully predicts boiling points for chemical classes entirely absent from training -- including inorganic compounds, salts, and molecules with elements like Si, B, and Te -- where structure-based models are fundamentally inapplicable. By encoding the intermolecular forces governing phase behavior, our framework establishes a generalizable strategy for property prediction beyond the structural boundaries of the existing methods.

Metadata

arXiv ID: 2603.12017
Provider: ARXIV
Primary Category: physics.chem-ph
Published: 2026-03-12
Fetched: 2026-03-13 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.12017v1</id>\n    <title>Thermodynamic Descriptors from Molecular Dynamics as Machine Learning Features for Extrapolable Property Prediction</title>\n    <updated>2026-03-12T14:59:42Z</updated>\n    <link href='https://arxiv.org/abs/2603.12017v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.12017v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Machine learning (ML) models which rely on molecular structure excel at predicting properties for well-represented organic compounds, however their limited ability to extrapolate to chemotypes outside their training domain, remains a critical bottleneck in chemical discovery. This challenge is particularly acute in industrial discovery, where navigating uncharted chemical space to generate new intellectual property is a primary objective. Normal boiling points serve as a key benchmark for testing the extrapolative power of ML algorithms. A major limitation is that group-contribution methods are by design unable to generate predictions for molecules containing unparameterized fragments. Here, we demonstrate that this limitation can be overcome by replacing structural descriptors with thermodynamic properties computed directly from molecular dynamics simulations. We introduce a physics-augmented framework where a CatBoost regression model learns directly from ensemble-averaged cohesive energies, heats of vaporization, and densities extracted from atomistic liquid-phase simulations. Benchmark comparisons reveal that while both our physics-augmented model and conventional structure-based models perform comparably well on standard organic compounds, only the former maintains controlled error growth when extrapolating to structurally dissimilar chemical space. Our model successfully predicts boiling points for chemical classes entirely absent from training -- including inorganic compounds, salts, and molecules with elements like Si, B, and Te -- where structure-based models are fundamentally inapplicable. By encoding the intermolecular forces governing phase behavior, our framework establishes a generalizable strategy for property prediction beyond the structural boundaries of the existing methods.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='physics.chem-ph'/>\n    <published>2026-03-12T14:59:42Z</published>\n    <arxiv:comment>16 pages, 6 figures</arxiv:comment>\n    <arxiv:primary_category term='physics.chem-ph'/>\n    <author>\n      <name>Nuria H. Espejo</name>\n    </author>\n    <author>\n      <name>Pablo Llombart</name>\n    </author>\n    <author>\n      <name>Andrés González de Castilla</name>\n    </author>\n    <author>\n      <name>Jorge Ramirez</name>\n    </author>\n    <author>\n      <name>Jorge R. Espinosa</name>\n    </author>\n    <author>\n      <name>Adiran Garaizar</name>\n    </author>\n  </entry>"
}