Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Such spatial information allows the definition of heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show a preliminary experiment where our heuristics are able to correctly recognize objects in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

Diligenti, M., Gori, M., M., K., Maggini, M., V., M. (2002). Web Page Classification Using Spatial Information. In Proceedings of the VIII Convegno AI*IA.

Web Page Classification Using Spatial Information

DILIGENTI, MICHELANGELO;GORI, MARCO;MAGGINI, MARCO;
2002-01-01

Abstract

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Such spatial information allows the definition of heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show a preliminary experiment where our heuristics are able to correctly recognize objects in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
2002
Diligenti, M., Gori, M., M., K., Maggini, M., V., M. (2002). Web Page Classification Using Spatial Information. In Proceedings of the VIII Convegno AI*IA.
File in questo prodotto:
File Dimensione Formato  
AIIA02b.pdf

non disponibili

Tipologia: PDF editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 42.47 kB
Formato Adobe PDF
42.47 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/36412
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo