Research

Paper

AI LLM March 11, 2026

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Authors

Jennifer D'Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

Abstract

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

Metadata

arXiv ID: 2603.10876
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-11
Fetched: 2026-03-12 04:21

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.10876v1</id>\n    <title>An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took \"Use of Practical AI in Digital Libraries\" seriously?</title>\n    <updated>2026-03-11T15:24:20Z</updated>\n    <link href='https://arxiv.org/abs/2603.10876v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.10876v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.DL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.IR'/>\n    <published>2026-03-11T15:24:20Z</published>\n    <arxiv:comment>9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Jennifer D'Souza</name>\n    </author>\n    <author>\n      <name>Sameer Sadruddin</name>\n    </author>\n    <author>\n      <name>Maximilian Kähler</name>\n    </author>\n    <author>\n      <name>Andrea Salfinger</name>\n    </author>\n    <author>\n      <name>Luca Zaccagna</name>\n    </author>\n    <author>\n      <name>Francesca Incitti</name>\n    </author>\n    <author>\n      <name>Lauro Snidaro</name>\n    </author>\n    <author>\n      <name>Osma Suominen</name>\n    </author>\n  </entry>"
}