The presence of replicas or near-replicas of documents is very common on the Web. Whilst replication can improve information accessibility for the users, the presence of near-replicas can hinder the effectiveness of search engines. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. The first signature is obtained by a random projection of the bag-of-words representation of the page contents. The second signature, referred to as Hypelink Map, is computed by a recursive equation which exploits the connectivity among the Web pages to encode the page context. The experimental results show that on the given dataset replicas and near-replicas can be detected with a precision and recall of 93%.

Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., Pucci, A. (2003). Detecting near replicas on the Web by content and hyperlink analysis. In Proceedings of the 12th World Wide Web Conference (WWW2003) (pp.249-255). New York : IEEE [10.1109/WI.2003.1241201].

Detecting near replicas on the Web by content and hyperlink analysis

Di Iorio E.;Diligenti M.;Gori M.;Maggini M.;Pucci A.
2003-01-01

Abstract

The presence of replicas or near-replicas of documents is very common on the Web. Whilst replication can improve information accessibility for the users, the presence of near-replicas can hinder the effectiveness of search engines. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. The first signature is obtained by a random projection of the bag-of-words representation of the page contents. The second signature, referred to as Hypelink Map, is computed by a recursive equation which exploits the connectivity among the Web pages to encode the page context. The experimental results show that on the given dataset replicas and near-replicas can be detected with a precision and recall of 93%.
2003
0-7695-1932-6
Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., Pucci, A. (2003). Detecting near replicas on the Web by content and hyperlink analysis. In Proceedings of the 12th World Wide Web Conference (WWW2003) (pp.249-255). New York : IEEE [10.1109/WI.2003.1241201].
File in questo prodotto:
File Dimensione Formato  
www03a.pdf

non disponibili

Tipologia: PDF editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 107.92 kB
Formato Adobe PDF
107.92 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/37959
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo