Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <3FA25F59.4020600@detego-software.de>
Date: Fri, 31 Oct 2003 14:10:49 +0100
From: Christoph Goller <goller@detego-software.de>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030624
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: subclassing of IndexReader
References: <1067341903.3f9e584f2e644@www.mailshell.com>
In-Reply-To: <1067341903.3f9e584f2e644@www.mailshell.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Hi Doug


> 
>>*)I am just curious. What is IndexReader.undeleteAll needed for?
> 
> 
> In Nutch we have a rotating set of indexes.  For example, we might create a new index every day.  
> Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g., every day merge 
> (or search w/o merging) the most recent 30 indexes.  So far so good.  But many pages are clones of 
> other pages: different urls with the same content.  So, each time we deploy a new set of indexes we 
> need to first perform duplicate detection to make sure that, for each unique content, only a single 
> url is present, that with the highest link analysis score.  I implement this by first calling 
> undeleteAll(), then perform the global duplicate detection, deleting duplicates from their index.  
> Does this make sense?  Each day duplicate detection must be repeated when a new index is added, but 
> first all of the previously detected duplicates must be cleared.
> 

That's quite interesting. I am currently involved in a small crawling project. We only crawl a very 
limited number of news pages, some of them several times per day. We found that there are often 
tiny changes on these pages (spelling corrections, banner changes) which we would like to ignore
(classify as dublicate) while we want to recognize bigger changes. For such a setting MD5 keys are
not very helpful. How do you detect dublicates in Nutch?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org