The presence of replicas or near-replicas of documents is very common on the Web. Whilst replication can improve information accessibility for the users, the presence of near-replicas can hinder the effectiveness of search engines. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. The first signature is obtained by a random projection of the bag-of-words representation of the page contents. The second signature, referred to as Hypelink Map, is computed by a recursive equation which exploits the connectivity among the Web pages to encode the page context. The experimental results show that on the given dataset replicas and near-replicas can be detected with a precision and recall of 93%.
Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., Pucci, A. (2003). Detecting near replicas on the Web by content and hyperlink analysis. In Proceedings of the 12th World Wide Web Conference (WWW2003) (pp.249-255). New York : IEEE [10.1109/WI.2003.1241201].
Detecting near replicas on the Web by content and hyperlink analysis
Di Iorio E.;Diligenti M.;Gori M.;Maggini M.;Pucci A.
2003-01-01
Abstract
The presence of replicas or near-replicas of documents is very common on the Web. Whilst replication can improve information accessibility for the users, the presence of near-replicas can hinder the effectiveness of search engines. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. The first signature is obtained by a random projection of the bag-of-words representation of the page contents. The second signature, referred to as Hypelink Map, is computed by a recursive equation which exploits the connectivity among the Web pages to encode the page context. The experimental results show that on the given dataset replicas and near-replicas can be detected with a precision and recall of 93%.| File | Dimensione | Formato | |
|---|---|---|---|
|
www03a.pdf
non disponibili
Tipologia:
PDF editoriale
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
107.92 kB
Formato
Adobe PDF
|
107.92 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/37959
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
