Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.
Diligenti, M., Gori, M., Maggini, M., Scarselli, F. (2001). Classification of HTML documents by Hidden Tree-Markov Models. In Proceedings of the 6th International Conference on Document Analisys, Recognition (ICDAR 2001) (pp.849-853). IEEE [10.1109/ICDAR.2001.953907].
Classification of HTML documents by Hidden Tree-Markov Models
Diligenti M.;Gori M.;Maggini M.;Scarselli F.
2001-01-01
Abstract
Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.File | Dimensione | Formato | |
---|---|---|---|
ICDAR01b.pdf
non disponibili
Tipologia:
Post-print
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
305.35 kB
Formato
Adobe PDF
|
305.35 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/35840
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo