lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: subclassing of IndexReader
Date Fri, 31 Oct 2003 13:10:49 GMT
Hi Doug


> 
>>*)I am just curious. What is IndexReader.undeleteAll needed for?
> 
> 
> In Nutch we have a rotating set of indexes.  For example, we might create a new index
every day.  
> Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g.,
every day merge 
> (or search w/o merging) the most recent 30 indexes.  So far so good.  But many pages
are clones of 
> other pages: different urls with the same content.  So, each time we deploy a new set
of indexes we 
> need to first perform duplicate detection to make sure that, for each unique content,
only a single 
> url is present, that with the highest link analysis score.  I implement this by first
calling 
> undeleteAll(), then perform the global duplicate detection, deleting duplicates from
their index.  
> Does this make sense?  Each day duplicate detection must be repeated when a new index
is added, but 
> first all of the previously detected duplicates must be cleared.
> 

That's quite interesting. I am currently involved in a small crawling project. We only crawl
a very 
limited number of news pages, some of them several times per day. We found that there are
often 
tiny changes on these pages (spelling corrections, banner changes) which we would like to
ignore
(classify as dublicate) while we want to recognize bigger changes. For such a setting MD5
keys are
not very helpful. How do you detect dublicates in Nutch?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message