Research

Paper

AI LLM March 09, 2026

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Authors

Jenny Kunz, Anja Jarochenko, Marcel Bollmann

Abstract

Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.

Metadata

arXiv ID: 2603.08450
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-09
Fetched: 2026-03-10 05:43

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.08450v1</id>\n    <title>A Dataset for Probing Translationese Preferences in English-to-Swedish Translation</title>\n    <updated>2026-03-09T14:46:35Z</updated>\n    <link href='https://arxiv.org/abs/2603.08450v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.08450v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-09T14:46:35Z</published>\n    <arxiv:comment>To appear at LREC 2026</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Jenny Kunz</name>\n    </author>\n    <author>\n      <name>Anja Jarochenko</name>\n    </author>\n    <author>\n      <name>Marcel Bollmann</name>\n    </author>\n  </entry>"
}