A Self-Organising Map Approach for Clustering of XML Documents

IRIS

The number of XML documents produced and available on the Internet is steadily increasing. It is thus important to devise automatic procedures to extract useful information from them with little or no intervention by a human operator. In this paper, we investigate the efficacy of an unsupervised learning approach, namely Self-Organising Maps (SOMs), for the automatic clustering of XML documents. Specifically, we consider a relatively large corpus of XML formatted data from the INEX initiative and evaluate it using two different self-organising map models. The first model is the classical SOM model, and it requires the XML documents to be represented by real-valued vectors, obtained using a "bag of words" (or better a "bag of tags") approach. The other model is the SOM for structured data (SOM-SD) approach which is able to cluster structured data, and it is possible to feed the model with tree structured representations of the XML documents, thus explicitly preserving the structural information in the documents. The experimental results show that the SOM model exhibits quite a poor performance on this problem domain which requires the ability to encode structural properties of the data. The SOM-SD model, on the other hand, is able to produce a good clustering and generalization performance. © 2006 IEEE.

Trentini, F., Hagenbuchner, M., Sperduti, A., Scarselli, F., Tsoi, A.C. (2006). A Self-Organising Map Approach for Clustering of XML Documents. In Proceedings of the 2006 IEEE International Joint Conference on Neural Networks (pp.3471-3478). New York : Institute of Electrical and Electronics Engineers ( IEEE ) [10.1109/ijcnn.2006.246898].

A Self-Organising Map Approach for Clustering of XML Documents

Trentini, F.;Hagenbuchner, M.;Sperduti, A.;Scarselli, F.;Tsoi, A. C.

2006-01-01

Abstract

The number of XML documents produced and available on the Internet is steadily increasing. It is thus important to devise automatic procedures to extract useful information from them with little or no intervention by a human operator. In this paper, we investigate the efficacy of an unsupervised learning approach, namely Self-Organising Maps (SOMs), for the automatic clustering of XML documents. Specifically, we consider a relatively large corpus of XML formatted data from the INEX initiative and evaluate it using two different self-organising map models. The first model is the classical SOM model, and it requires the XML documents to be represented by real-valued vectors, obtained using a "bag of words" (or better a "bag of tags") approach. The other model is the SOM for structured data (SOM-SD) approach which is able to cluster structured data, and it is possible to feed the model with tree structured representations of the XML documents, thus explicitly preserving the structural information in the documents. The experimental results show that the SOM model exhibits quite a poor performance on this problem domain which requires the ability to encode structural properties of the data. The SOM-SD model, on the other hand, is able to produce a good clustering and generalization performance. © 2006 IEEE.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2006
			
	Codice ISBN
	
				9780780394902
			
	Citazione
	
				Trentini, F., Hagenbuchner, M., Sperduti, A., Scarselli, F., Tsoi, A.C. (2006). A Self-Organising Map Approach for Clustering of XML Documents. In Proceedings of the 2006 IEEE International Joint Conference on Neural Networks (pp.3471-3478). New York : Institute of Electrical and Electronics Engineers ( IEEE ) [10.1109/ijcnn.2006.246898].
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ijcnn2006.pdf non disponibili Tipologia: PDF editoriale Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 407.46 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	407.46 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/43068

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo