Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 48320 invoked from network); 8 Aug 2006 21:32:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Aug 2006 21:32:38 -0000 Received: (qmail 30436 invoked by uid 500); 8 Aug 2006 21:32:36 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 30416 invoked by uid 500); 8 Aug 2006 21:32:36 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 30405 invoked by uid 99); 8 Aug 2006 21:32:36 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Aug 2006 14:32:36 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Aug 2006 14:32:35 -0700 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id 678915B77D; Tue, 8 Aug 2006 14:32:10 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id 5B5677F403 for ; Tue, 8 Aug 2006 14:32:10 -0700 (PDT) Date: Tue, 8 Aug 2006 14:32:10 -0700 (PDT) From: Chris Hostetter To: general@lucene.apache.org Subject: Re: Need advice for doing incremental Index updates In-Reply-To: <8963E18186202146AD4AF866A1B0CAA4890136@sheex366.corp.conet.local> Message-ID: References: <8963E18186202146AD4AF866A1B0CAA4890136@sheex366.corp.conet.local> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N i would solve your problem external to the index ... everytime you run your incrimental process, as you walk your directory tree of files (adding the new ones, deleting/readdign the modified ones) record every file and save that somewhere. when you are all done, compare the list from this run with the list from the last run -- any file in the old list and not in hte new list is a document to be deleted. : Date: Tue, 8 Aug 2006 15:48:16 +0200 : From: "Leimbach, Johannes" : Reply-To: general@lucene.apache.org : To: general@lucene.apache.org : Subject: Need advice for doing incremental Index updates : : Hello, : : : : I need some advice regarding incremental index updates. : : : : There are three cases I need to handle when iterating over the : sourcefiles (files that need to be indexed): : : 1. A file did not change since the last update : 2. A file did change since the last update : 3. A file was removed since the last update : : : : Case 1. is easy... : : Case 2. as well.. just remove the old file and add the new one : : Case 3. is bugging me.. : : : : How can I find out if a file which is specified in the index, does not : exist anymore? : : : : The blunt solution would be to retrieve *all* file paths from the index, : and check whether each one exists. If so - go on, if the file does not : exist on disk, remove it from the index. The problem I have with this : is, that I am possibly pulling a lot of data from the lucene index. I : will also do a lot of local filesystem checks. Sloooow?! : : : : Another idea I had is about introducing an "index version" integer. This : number will be unique for each start of the parsing process. So each : time my indexer program is started a new "index version" is created. Now : each file which exists in the index and gets processed will have the : "index version" number stored as a document field. : : This way all newly added and modified documents will have an up to date : "index version" flag after indexing is complete. : : Now, to remove all physically deleted files from the index, I would : select all documents which have an old "index version" flag stored : inside them. Every document with such an old number can be safely : removed. : : Problem with this solution is, that *every* document in the index will : get updated: First the old index version field is removed, then the new : field is added. : : On the plusside, removing deleted files will be very fast. : : : : : : What would you recommend for keeping an incremental update? : : I fear the first version will be utterly slow for small updates whereas : the second version will be a lot faster - though adding stuff is slower : because of the additional field update for every document. : : : : Thanks for your advice, : : Johannes :-) : : : : : : -Hoss