Nowadays most of Web pages are automatically assembled by content management systems or editing tools that apply a fixed template to give a uniform structure to all the documents beloging to the same site. The tem- plate usually contains side information that provides better graphics, navigation bars and menus, banners and advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic processing of Web documents. In this paper, we present a novel template removing technique that exploits a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experi- mental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template removing technique is shown on a Web page clustering task.

Geraci, F., Maggini, M. (2011). A Multi-sequence Alignment Algorithm for Web Template Detection. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2011) (pp.121-128).

A Multi-sequence Alignment Algorithm for Web Template Detection

GERACI, FILIPPO;MAGGINI, MARCO
2011-01-01

Abstract

Nowadays most of Web pages are automatically assembled by content management systems or editing tools that apply a fixed template to give a uniform structure to all the documents beloging to the same site. The tem- plate usually contains side information that provides better graphics, navigation bars and menus, banners and advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic processing of Web documents. In this paper, we present a novel template removing technique that exploits a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experi- mental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template removing technique is shown on a Web page clustering task.
9789898425799
Geraci, F., Maggini, M. (2011). A Multi-sequence Alignment Algorithm for Web Template Detection. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2011) (pp.121-128).
File in questo prodotto:
File Dimensione Formato  
KDIR_2011_146_CR.pdf

non disponibili

Tipologia: Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 280.91 kB
Formato Adobe PDF
280.91 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/36592
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo