Paper
Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
Authors
Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini
Abstract
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.17475v1</id>\n <title>Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian</title>\n <updated>2026-02-19T15:38:46Z</updated>\n <link href='https://arxiv.org/abs/2602.17475v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.17475v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether \"small\" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n <published>2026-02-19T15:38:46Z</published>\n <arxiv:comment>Paper Accepted at LREC 2026</arxiv:comment>\n <arxiv:primary_category term='cs.CL'/>\n <author>\n <name>Pietro Ferrazzi</name>\n </author>\n <author>\n <name>Mattia Franzin</name>\n </author>\n <author>\n <name>Alberto Lavelli</name>\n </author>\n <author>\n <name>Bernardo Magnini</name>\n </author>\n </entry>"
}