Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 45847 invoked from network); 31 Oct 2003 13:07:35 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 31 Oct 2003 13:07:35 -0000 Received: (qmail 49357 invoked by uid 500); 31 Oct 2003 13:07:32 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 49123 invoked by uid 500); 31 Oct 2003 13:07:30 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 49102 invoked from network); 31 Oct 2003 13:07:30 -0000 Received: from unknown (HELO moutng.kundenserver.de) (212.227.126.177) by daedalus.apache.org with SMTP; 31 Oct 2003 13:07:30 -0000 Received: from [212.227.126.208] (helo=mrelayng.kundenserver.de) by moutng.kundenserver.de with esmtp (Exim 3.35 #1) id 1AFYzy-0001ax-00 for lucene-dev@jakarta.apache.org; Fri, 31 Oct 2003 14:07:30 +0100 Received: from [62.245.162.215] (helo=detego-software.de) by mrelayng.kundenserver.de with asmtp (TLSv1:RC4-MD5:128) (Exim 3.35 #1) id 1AFYzy-00014V-00 for lucene-dev@jakarta.apache.org; Fri, 31 Oct 2003 14:07:30 +0100 Message-ID: <3FA25F59.4020600@detego-software.de> Date: Fri, 31 Oct 2003 14:10:49 +0100 From: Christoph Goller User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030624 X-Accept-Language: de, en-us, en, de-at MIME-Version: 1.0 To: Lucene Developers List Subject: Re: subclassing of IndexReader References: <1067341903.3f9e584f2e644@www.mailshell.com> In-Reply-To: <1067341903.3f9e584f2e644@www.mailshell.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi Doug > >>*)I am just curious. What is IndexReader.undeleteAll needed for? > > > In Nutch we have a rotating set of indexes. For example, we might create a new index every day. > Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g., every day merge > (or search w/o merging) the most recent 30 indexes. So far so good. But many pages are clones of > other pages: different urls with the same content. So, each time we deploy a new set of indexes we > need to first perform duplicate detection to make sure that, for each unique content, only a single > url is present, that with the highest link analysis score. I implement this by first calling > undeleteAll(), then perform the global duplicate detection, deleting duplicates from their index. > Does this make sense? Each day duplicate detection must be repeated when a new index is added, but > first all of the previously detected duplicates must be cleared. > That's quite interesting. I am currently involved in a small crawling project. We only crawl a very limited number of news pages, some of them several times per day. We found that there are often tiny changes on these pages (spelling corrections, banner changes) which we would like to ignore (classify as dublicate) while we want to recognize bigger changes. For such a setting MD5 keys are not very helpful. How do you detect dublicates in Nutch? Christoph --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org