Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.

Zugarini, A., Rigutini, L. (2025). PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025).

PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain

Andrea Zugarini
;
Leonardo Rigutini
2025-01-01

Abstract

Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.
2025
Zugarini, A., Rigutini, L. (2025). PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025).
File in questo prodotto:
File Dimensione Formato  
109_main_long.pdf

accesso aperto

Tipologia: PDF editoriale
Licenza: Creative commons
Dimensione 255.38 kB
Formato Adobe PDF
255.38 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1301660