Building a Trustworthy Retrieval-Augmented Generation (RAG) chatbot for Italy’s public sector presents challenges that go beyond selecting an appropriate Large Language Model. A major issue is the retrieval phase, where Italian text embedders often underperform compared to English and multilingual counterparts, hindering precise identification and contextualization of critical information. Regulatory constraints further complicate matters by disallowing closed source or cloud based models, forcing reliance on on-premise or fully open source solutions that may not fully address the linguistic complexities of Italian documents. In our study, we evaluate three embedding approaches using a publicly available Italian dataset: a monolingual Italian approach, a translation based method leveraging English only embedders with backward reference mapping, and a multilingual framework applied to both original and translated texts. Our methodology involves chunking documents into coherent segments, embedding them in a high dimensional semantic space, and measuring retrieval accuracy via top-k similarity searches. Our results indicate that the translation based approach significantly improves retrieval performance over Italian specific models, suggesting that bilingual mapping can effectively address both domain specific challenges and regulatory constraints in developing RAG pipelines for public administration.

Mala, C.S., Di Maio, C., Proietti, M., Gezici, G., Giannotti, F., Melacci, S., et al. (2025). Towards Building a Trustworthy RAG-Based Chatbot for the Italian Public Administration. In Frontiers in Artificial Intelligence and Applications (pp.196-204). IOS Press BV [10.3233/faia250637].

Towards Building a Trustworthy RAG-Based Chatbot for the Italian Public Administration

di Maio, Christian;Melacci, Stefano;Gori, Marco
2025-01-01

Abstract

Building a Trustworthy Retrieval-Augmented Generation (RAG) chatbot for Italy’s public sector presents challenges that go beyond selecting an appropriate Large Language Model. A major issue is the retrieval phase, where Italian text embedders often underperform compared to English and multilingual counterparts, hindering precise identification and contextualization of critical information. Regulatory constraints further complicate matters by disallowing closed source or cloud based models, forcing reliance on on-premise or fully open source solutions that may not fully address the linguistic complexities of Italian documents. In our study, we evaluate three embedding approaches using a publicly available Italian dataset: a monolingual Italian approach, a translation based method leveraging English only embedders with backward reference mapping, and a multilingual framework applied to both original and translated texts. Our methodology involves chunking documents into coherent segments, embedding them in a high dimensional semantic space, and measuring retrieval accuracy via top-k similarity searches. Our results indicate that the translation based approach significantly improves retrieval performance over Italian specific models, suggesting that bilingual mapping can effectively address both domain specific challenges and regulatory constraints in developing RAG pipelines for public administration.
2025
9781643686110
Mala, C.S., Di Maio, C., Proietti, M., Gezici, G., Giannotti, F., Melacci, S., et al. (2025). Towards Building a Trustworthy RAG-Based Chatbot for the Italian Public Administration. In Frontiers in Artificial Intelligence and Applications (pp.196-204). IOS Press BV [10.3233/faia250637].
File in questo prodotto:
File Dimensione Formato  
melacci_HHAI2025.pdf

accesso aperto

Tipologia: PDF editoriale
Licenza: Creative commons
Dimensione 392.4 kB
Formato Adobe PDF
392.4 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1315905