Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Date: Tue, 8 Aug 2006 14:32:10 -0700 (PDT)
From: Chris Hostetter <hossman_lucene@fucit.org>
To: general@lucene.apache.org
Subject: Re: Need advice for doing incremental Index updates
In-Reply-To: 
 <8963E18186202146AD4AF866A1B0CAA4890136@sheex366.corp.conet.local>
Message-ID: <Pine.LNX.4.58.0608081430390.3615@hal.rescomp.berkeley.edu>
References: <8963E18186202146AD4AF866A1B0CAA4890136@sheex366.corp.conet.local>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII


i would solve your problem external to the index ... everytime you run
your incrimental process, as you walk your directory tree of files (adding
the new ones, deleting/readdign the modified ones) record every file and
save that somewhere.  when you are all done, compare the list from this
run with the list from the last run -- any file in the old list and not in
hte new list is a document to be deleted.


: Date: Tue, 8 Aug 2006 15:48:16 +0200
: From: "Leimbach, Johannes" <JLeimbach@CONET.DE>
: Reply-To: general@lucene.apache.org
: To: general@lucene.apache.org
: Subject: Need advice for doing incremental Index updates
:
: Hello,
:
:
:
: I need some advice regarding incremental index updates.
:
:
:
: There are three cases I need to handle when iterating over the
: sourcefiles (files that need to be indexed):
:
: 1.	A file did not change since the last update
: 2.	A file did change since the last update
: 3.	A file was removed since the last update
:
:
:
: Case 1. is easy...
:
: Case 2. as well.. just remove the old file and add the new one
:
: Case 3. is bugging me..
:
:
:
: How can I find out if a file which is specified in the index, does not
: exist anymore?
:
:
:
: The blunt solution would be to retrieve *all* file paths from the index,
: and check whether each one exists. If so - go on, if the file does not
: exist on disk, remove it from the index. The problem I have with this
: is, that I am possibly pulling a lot of data from the lucene index. I
: will also do a lot of local filesystem checks. Sloooow?!
:
:
:
: Another idea I had is about introducing an "index version" integer. This
: number will be unique for each start of the parsing process. So each
: time my indexer program is started a new "index version" is created. Now
: each file which exists in the index and gets processed will have the
: "index version" number stored as a document field.
:
: This way all newly added and modified documents will have an up to date
: "index version" flag after indexing is complete.
:
: Now, to remove all physically deleted files from the index, I would
: select all documents which have an old "index version" flag stored
: inside them. Every document with such an old number can be safely
: removed.
:
: Problem with this solution is, that *every* document in the index will
: get updated: First the old index version field is removed, then the new
: field is added.
:
: On the plusside, removing deleted files will be very fast.
:
:
:
:
:
: What would you recommend for keeping an incremental update?
:
: I fear the first version will be utterly slow for small updates whereas
: the second version will be a lot faster - though adding stuff is slower
: because of the additional field update for every document.
:
:
:
: Thanks for your advice,
:
: Johannes :-)
:
:
:
:
:
:


-Hoss