A Semi-supervised Document Clustering Algorithm based on EM

Rigutini, Leonardo; Maggini, Marco

doi:10.1109/WI.2005.13

Document clustering is a very hard task in automatic text processing since it requires extracting regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized. Semi-supervised clustering lies in between automatic categorization and auto-organization. It is assumed that the supervisor is not required to specify a set of classes, but only to provide a set of texts grouped by the criteria to be used, to organize the collection. In this paper, we present a novel algorithm for clustering text documents which exploits the EM algorithm together with a feature selection technique based on information gain. The experimental results show that only very few documents are needed to initialize the clusters and that the algorithm is able to properly extract the regularities hidden in a huge unlabeled collection.

Rigutini, L., Maggini, M. (2005). A Semi-supervised Document Clustering Algorithm based on EM. In Proceedings of the IEEE/ACM/WI International Conference on Web Intelligence (WIC 2005) (pp.200-206). IEEE [10.1109/WI.2005.13].

A Semi-supervised Document Clustering Algorithm based on EM

Rigutini, Leonardo;Maggini, Marco

2005-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2005
			
	Codice ISBN
	
				076952415X
			
	Citazione
	
				Rigutini, L., Maggini, M. (2005). A Semi-supervised Document Clustering Algorithm based on EM. In Proceedings of the IEEE/ACM/WI International Conference on Web Intelligence (WIC 2005) (pp.200-206). IEEE [10.1109/WI.2005.13].
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
WI05a.pdf non disponiibile Tipologia: Post-print Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 210.01 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	210.01 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/37016

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

A Semi-supervised Document Clustering Algorithm based on EM

Rigutini, Leonardo;Maggini, Marco

2005-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)