Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

IRIS

Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose, a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V. (2002). Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In Proceedings of the Second IEEE International Conference on Data Mining (ICDM'02) (pp.250-257). Los Alamitos : IEEE Computer Society [10.1109/ICDM.2002.1183910].

Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

Kovacevic M.;Diligenti M.;Gori M.;Milutinovic V.

2002-01-01

Abstract

Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose, a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2002
			
	Codice ISBN
	
				0769517544
9780769517544
			
	Citazione
	
				Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V. (2002). Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In Proceedings of the Second IEEE International Conference on Data Mining (ICDM'02) (pp.250-257). Los Alamitos : IEEE Computer Society [10.1109/ICDM.2002.1183910].
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/33005

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo