lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: subclassing of IndexReader
Date Mon, 03 Nov 2003 17:02:05 GMT
Christoph Goller wrote:
> That's quite interesting. I am currently involved in a small crawling 
> project. We only crawl a very limited number of news pages, some of them 
> several times per day. We found that there are often tiny changes on 
> these pages (spelling corrections, banner changes) which we would like 
> to ignore
> (classify as dublicate) while we want to recognize bigger changes. For 
> such a setting MD5 keys are
> not very helpful. How do you detect dublicates in Nutch?

Nutch currently only does MD5-based duplicate elimination.  So only 
exact duplicates are eliminated.

There's been a fair amount of work on better methods.  For example, 
there was Broder et. al.'s "Syntactic Clustering" work 
(http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/).

However I've never seen anyone demonstrate how such methods can be 
efficiently applied to huge collections.  Perhaps they can, but it's not 
obvious to me.  I've also not followed this literature closely.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message