Research

Paper

TESTING February 19, 2026

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Authors

Changhun Kim, Martin Mayr, Thomas Gorges, Fei Wu, Mathias Seuret, Andreas Maier, Vincent Christlein

Abstract

State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

Metadata

arXiv ID: 2602.17387

Provider: ARXIV

Primary Category: cs.CV

Published: 2026-02-19

Fetched: 2026-02-21 18:51

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.17387v1</id>\n    <title>DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition</title>\n    <updated>2026-02-19T14:12:07Z</updated>\n    <link href='https://arxiv.org/abs/2602.17387v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.17387v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-02-19T14:12:07Z</published>\n    <arxiv:comment>Submitted to Pattern Recognition, 11 pages + 2-page appendix, 7 figures, 12 tables</arxiv:comment>\n    <arxiv:primary_category term='cs.CV'/>\n    <author>\n      <name>Changhun Kim</name>\n    </author>\n    <author>\n      <name>Martin Mayr</name>\n    </author>\n    <author>\n      <name>Thomas Gorges</name>\n    </author>\n    <author>\n      <name>Fei Wu</name>\n    </author>\n    <author>\n      <name>Mathias Seuret</name>\n    </author>\n    <author>\n      <name>Andreas Maier</name>\n    </author>\n    <author>\n      <name>Vincent Christlein</name>\n    </author>\n  </entry>"
}