Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.

Kovačević, M., Diligenti, M., Gori, M., Milutinović, V. (2002). Recognition of Common Areas in a Web Page Using Visualization Approach. In Proceedings of the Tenth International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA) (pp.203-212). Springer [10.1007/3-540-46148-5_21].

Recognition of Common Areas in a Web Page Using Visualization Approach

Diligenti, Michelangelo;Gori, Marco;
2002-01-01

Abstract

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.
2002
3540441271
Kovačević, M., Diligenti, M., Gori, M., Milutinović, V. (2002). Recognition of Common Areas in a Web Page Using Visualization Approach. In Proceedings of the Tenth International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA) (pp.203-212). Springer [10.1007/3-540-46148-5_21].
File in questo prodotto:
File Dimensione Formato  
2002-kovacevic-AIMSA.pdf

non disponibili

Tipologia: Pre-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 139.85 kB
Formato Adobe PDF
139.85 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/43064
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo