Paper
Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes
Authors
Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie
Abstract
Articulatory-to-acoustic inversion strongly depends on the type of data used. While most previous studies rely on EMA, which is limited by the number of sensors and restricted to accessible articulators, we propose an approach aiming at a complete inversion of the vocal tract, from the glottis to the lips. To this end, we used approximately 3.5 hours of RT-MRI data from a single speaker. The innovation of our approach lies in the use of articulator contours automatically extracted from MRI images, rather than relying on the raw images themselves. By focusing on these contours, the model prioritizes the essential geometric dynamics of the vocal tract while discarding redundant pixel-level information. These contours, alongside denoised audio, were then processed using a Bi-LSTM architecture. Two experiments were conducted: (1) the analysis of the impact of the audio embedding, for which three types of embeddings were evaluated as input to the model (MFCCs, LCCs, and HuBERT), and (2) the study of the influence of the dataset size, which we varied from 10 minutes to 3.5 hours. Evaluation was performed on the test data using RMSE, median error, as well as Tract Variables, to which we added an additional measurement: the larynx height. The average RMSE obtained is 1.48\,mm, compared with the pixel size (1.62\,mm). These results confirm the feasibility of a complete vocal-tract inversion using RT-MRI data.
Metadata
Related papers
Cosmic Shear in Effective Field Theory at Two-Loop Order: Revisiting $S_8$ in Dark Energy Survey Data
Shi-Fan Chen, Joseph DeRose, Mikhail M. Ivanov, Oliver H. E. Philcox • 2026-03-30
Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, Da... • 2026-03-30
SNID-SAGE: A Modern Framework for Interactive Supernova Classification and Spectral Analysis
Fiorenzo Stoppa, Stephen J. Smartt • 2026-03-30
Rotating black hole shadows in metric-affine bumblebee gravity
Jose R. Nascimento, Ana R. M. Oliveira, Albert Yu. Petrov, Paulo J. Porfírio,... • 2026-03-30
Scalability of the asynchronous discontinuous Galerkin method for compressible flow simulations
Shubham Kumar Goswami, Dapse Vidyesh, Konduri Aditya • 2026-03-30
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.28723v1</id>\n <title>Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes</title>\n <updated>2026-03-30T17:38:46Z</updated>\n <link href='https://arxiv.org/abs/2603.28723v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.28723v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Articulatory-to-acoustic inversion strongly depends on the type of data used. While most previous studies rely on EMA, which is limited by the number of sensors and restricted to accessible articulators, we propose an approach aiming at a complete inversion of the vocal tract, from the glottis to the lips. To this end, we used approximately 3.5 hours of RT-MRI data from a single speaker. The innovation of our approach lies in the use of articulator contours automatically extracted from MRI images, rather than relying on the raw images themselves. By focusing on these contours, the model prioritizes the essential geometric dynamics of the vocal tract while discarding redundant pixel-level information. These contours, alongside denoised audio, were then processed using a Bi-LSTM architecture. Two experiments were conducted: (1) the analysis of the impact of the audio embedding, for which three types of embeddings were evaluated as input to the model (MFCCs, LCCs, and HuBERT), and (2) the study of the influence of the dataset size, which we varied from 10 minutes to 3.5 hours. Evaluation was performed on the test data using RMSE, median error, as well as Tract Variables, to which we added an additional measurement: the larynx height. The average RMSE obtained is 1.48\\,mm, compared with the pixel size (1.62\\,mm). These results confirm the feasibility of a complete vocal-tract inversion using RT-MRI data.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='eess.AS'/>\n <published>2026-03-30T17:38:46Z</published>\n <arxiv:primary_category term='eess.AS'/>\n <author>\n <name>Sofiane Azzouz</name>\n </author>\n <author>\n <name>Pierre-André Vuissoz</name>\n </author>\n <author>\n <name>Yves Laprie</name>\n </author>\n </entry>"
}