A Multi-sequence Alignment Algorithm for Web Template Detection

Geraci, Filippo; Maggini, Marco

Nowadays most of Web pages are automatically assembled by content management systems or editing tools that apply a fixed template to give a uniform structure to all the documents beloging to the same site. The tem- plate usually contains side information that provides better graphics, navigation bars and menus, banners and advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic processing of Web documents. In this paper, we present a novel template removing technique that exploits a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experi- mental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template removing technique is shown on a Web page clustering task.

Geraci, F., Maggini, M. (2011). A Multi-sequence Alignment Algorithm for Web Template Detection. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2011) (pp.121-128).

A Multi-sequence Alignment Algorithm for Web Template Detection

GERACI, FILIPPO;MAGGINI, MARCO

2011-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2011
		
	Codice ISBN
	
			9789898425799
		
	Citazione
	
			Geraci, F., Maggini, M. (2011). A Multi-sequence Alignment Algorithm for Web Template Detection. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2011) (pp.121-128).
		
	Appare nelle tipologie:
	
			4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
KDIR_2011_146_CR.pdf non disponibili Tipologia: Post-print Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 280.91 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	280.91 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/36592

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

A Multi-sequence Alignment Algorithm for Web Template Detection

GERACI, FILIPPO;MAGGINI, MARCO

2011-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)