This paper addresses the problem of categorizing terms or lexical entities into a predefined set of semantic domains exploiting the knowledge available on-line in the Web. The proposed system can be effectively used for the automatic expansion of thesauri, limiting the human effort to the preparation of a small training set of tagged entities. The classification of terms is performed by modeling the contexts in which terms from the same class usually appear. The Web is exploited as a significant repository of contexts that are extracted by querying one or more search engines. In particular, it is shown how the required knowledge can be obtained directly from the snippets returned by the search engines without the overhead of document downloads. Since the Web is continuously updated “World Wide”, this approach allows us to face the problem of open-domain term categorization handling both the geographical and temporal variability of term semantics. The performances attained by different text classifiers are compared, showing that the accuracy results are very good independently of the specific model, thus validating the idea of using term contexts extracted from search engine snippets. Moreover, the experimental results indicate that only very few training examples are needed to reach the best performance (over 90% for the F1 measure).

Rigutini, L., DI IORIO, E., Ernandes, M., Maggini, M. (2006). Automatic Term Categorization by Extracting Knowledge from the Web. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI 2006) (pp.531-535). IOS Press.

Automatic Term Categorization by Extracting Knowledge from the Web

RIGUTINI, LEONARDO;DI IORIO, ERNESTO;ERNANDES, MARCO;MAGGINI, MARCO
2006-01-01

Abstract

This paper addresses the problem of categorizing terms or lexical entities into a predefined set of semantic domains exploiting the knowledge available on-line in the Web. The proposed system can be effectively used for the automatic expansion of thesauri, limiting the human effort to the preparation of a small training set of tagged entities. The classification of terms is performed by modeling the contexts in which terms from the same class usually appear. The Web is exploited as a significant repository of contexts that are extracted by querying one or more search engines. In particular, it is shown how the required knowledge can be obtained directly from the snippets returned by the search engines without the overhead of document downloads. Since the Web is continuously updated “World Wide”, this approach allows us to face the problem of open-domain term categorization handling both the geographical and temporal variability of term semantics. The performances attained by different text classifiers are compared, showing that the accuracy results are very good independently of the specific model, thus validating the idea of using term contexts extracted from search engine snippets. Moreover, the experimental results indicate that only very few training examples are needed to reach the best performance (over 90% for the F1 measure).
2006
1586036424
Rigutini, L., DI IORIO, E., Ernandes, M., Maggini, M. (2006). Automatic Term Categorization by Extracting Knowledge from the Web. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI 2006) (pp.531-535). IOS Press.
File in questo prodotto:
File Dimensione Formato  
ECAI06.pdf

non disponibili

Tipologia: Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 480.23 kB
Formato Adobe PDF
480.23 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/36546
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo