PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain

IRIS

Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.

Zugarini, A., Rigutini, L. (2025). PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025).

PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain

Andrea Zugarini;Leonardo Rigutini

2025-01-01

Abstract

Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Citazione
	
				Zugarini, A., Rigutini, L. (2025). PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025).
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
109_main_long.pdf accesso aperto Tipologia: PDF editoriale Licenza: Creative commons Dimensione 255.38 kB Formato Adobe PDF Visualizza/Apri	255.38 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1301660