This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the rele- vance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point- first algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehen- sive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.

Geraci, F., M., P., Maggini, M., F., S. (2006). Cluster Generation and Labeling for Web Snippets: a Fast, Accurate Hierarchical Solution. INTERNET MATHEMATICS, 3(4), 413-443 [10.1080/15427951.2006.10129133].

Cluster Generation and Labeling for Web Snippets: a Fast, Accurate Hierarchical Solution

GERACI, FILIPPO;MAGGINI, MARCO;
2006-01-01

Abstract

This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the rele- vance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point- first algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehen- sive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.
2006
Geraci, F., M., P., Maggini, M., F., S. (2006). Cluster Generation and Labeling for Web Snippets: a Fast, Accurate Hierarchical Solution. INTERNET MATHEMATICS, 3(4), 413-443 [10.1080/15427951.2006.10129133].
File in questo prodotto:
File Dimensione Formato  
jim07.pdf

non disponibili

Tipologia: PDF editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 376.88 kB
Formato Adobe PDF
376.88 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/29631