Due to the globalization on the Web, many companies and institutions need to efficiently organize and search repositories containing multilingual documents. The management of these heterogeneous text collections increases the costs significantly because experts of different languages are required to organize these collections. Cross-language text categorization can provide techniques to extend existing automatic classification systems in one language to new languages without requiring additional intervention of human experts. In this paper, we propose a learning algorithm based on the EM scheme which can be used to train text classifiers in a multilingual environment. In particular, in the proposed approach, we assume that a predefined category set and a collection of labeled training data is available for a given language L1. A classifier for a different language L2 is trained by translating the available labeled training set for L1 to L2 and by using an additional set of unlabeled documents from L2. This technique allows us to extract correct statistical properties of the language L2 which are not completely available in automatically translated examples, because of the different characteristics of language L1 and of the approximation of the translation process. Our experimental results show that the performance of the proposed method is very promising when applied on a test document set extracted from newsgroups in English and Italian.

Rigutini, L., Maggini, M., B., L. (2005). An EM based training algorithm for Cross-Language Text Categorization. In Proceedings of the IEEE/ACM/WI International Conference on Web Intelligence (WIC 2005) (pp.529-535) [10.1109/WI.2005.29].

An EM based training algorithm for Cross-Language Text Categorization

RIGUTINI, LEONARDO;MAGGINI, MARCO;
2005-01-01

Abstract

Due to the globalization on the Web, many companies and institutions need to efficiently organize and search repositories containing multilingual documents. The management of these heterogeneous text collections increases the costs significantly because experts of different languages are required to organize these collections. Cross-language text categorization can provide techniques to extend existing automatic classification systems in one language to new languages without requiring additional intervention of human experts. In this paper, we propose a learning algorithm based on the EM scheme which can be used to train text classifiers in a multilingual environment. In particular, in the proposed approach, we assume that a predefined category set and a collection of labeled training data is available for a given language L1. A classifier for a different language L2 is trained by translating the available labeled training set for L1 to L2 and by using an additional set of unlabeled documents from L2. This technique allows us to extract correct statistical properties of the language L2 which are not completely available in automatically translated examples, because of the different characteristics of language L1 and of the approximation of the translation process. Our experimental results show that the performance of the proposed method is very promising when applied on a test document set extracted from newsgroups in English and Italian.
2005
076952415X
Rigutini, L., Maggini, M., B., L. (2005). An EM based training algorithm for Cross-Language Text Categorization. In Proceedings of the IEEE/ACM/WI International Conference on Web Intelligence (WIC 2005) (pp.529-535) [10.1109/WI.2005.29].
File in questo prodotto:
File Dimensione Formato  
WI05b.pdf

non disponibili

Tipologia: Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 139.17 kB
Formato Adobe PDF
139.17 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/38701
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo