Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.

Diligenti, M., Gori, M., Maggini, M., Scarselli, F. (2001). Classification of HTML documents by Hidden Tree-Markov Models. In Proceedings of the 6th International Conference on Document Analisys, Recognition (ICDAR 2001) (pp.849-853) [10.1109/ICDAR.2001.953907].

Classification of HTML documents by Hidden Tree-Markov Models

DILIGENTI, MICHELANGELO;GORI, MARCO;MAGGINI, MARCO;SCARSELLI, FRANCO
2001-01-01

Abstract

Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.
0769512631
Diligenti, M., Gori, M., Maggini, M., Scarselli, F. (2001). Classification of HTML documents by Hidden Tree-Markov Models. In Proceedings of the 6th International Conference on Document Analisys, Recognition (ICDAR 2001) (pp.849-853) [10.1109/ICDAR.2001.953907].
File in questo prodotto:
File Dimensione Formato  
ICDAR01b.pdf

non disponibili

Tipologia: Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 305.35 kB
Formato Adobe PDF
305.35 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/35840
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo