Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.
Zugarini, A., Rigutini, L. (2025). PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025).
PharmaER.IT: an Italian dataset for entity recognition in the pharmaceutical domain
Andrea Zugarini
;Leonardo Rigutini
2025-01-01
Abstract
Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.| File | Dimensione | Formato | |
|---|---|---|---|
|
109_main_long.pdf
accesso aperto
Tipologia:
PDF editoriale
Licenza:
Creative commons
Dimensione
255.38 kB
Formato
Adobe PDF
|
255.38 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1301660
