lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John" <...@e5systems.com>
Subject Re: Need advice for doing incremental Index updates
Date Wed, 09 Aug 2006 00:58:31 GMT
Hi,
If run the incrimental process,as walk my directory tree of files,does it 
cost more time?
Because I must run a thread to do as you said,and it runs all the time.
Thanks ,
john




----- Original Message ----- 
From: "Chris Hostetter" <hossman_lucene@fucit.org>
To: <general@lucene.apache.org>
Sent: Wednesday, August 09, 2006 5:32 AM
Subject: Re: Need advice for doing incremental Index updates


>
> i would solve your problem external to the index ... everytime you run
> your incrimental process, as you walk your directory tree of files (adding
> the new ones, deleting/readdign the modified ones) record every file and
> save that somewhere.  when you are all done, compare the list from this
> run with the list from the last run -- any file in the old list and not in
> hte new list is a document to be deleted.
>
>
> : Date: Tue, 8 Aug 2006 15:48:16 +0200
> : From: "Leimbach, Johannes" <JLeimbach@CONET.DE>
> : Reply-To: general@lucene.apache.org
> : To: general@lucene.apache.org
> : Subject: Need advice for doing incremental Index updates
> :
> : Hello,
> :
> :
> :
> : I need some advice regarding incremental index updates.
> :
> :
> :
> : There are three cases I need to handle when iterating over the
> : sourcefiles (files that need to be indexed):
> :
> : 1. A file did not change since the last update
> : 2. A file did change since the last update
> : 3. A file was removed since the last update
> :
> :
> :
> : Case 1. is easy...
> :
> : Case 2. as well.. just remove the old file and add the new one
> :
> : Case 3. is bugging me..
> :
> :
> :
> : How can I find out if a file which is specified in the index, does not
> : exist anymore?
> :
> :
> :
> : The blunt solution would be to retrieve *all* file paths from the index,
> : and check whether each one exists. If so - go on, if the file does not
> : exist on disk, remove it from the index. The problem I have with this
> : is, that I am possibly pulling a lot of data from the lucene index. I
> : will also do a lot of local filesystem checks. Sloooow?!
> :
> :
> :
> : Another idea I had is about introducing an "index version" integer. This
> : number will be unique for each start of the parsing process. So each
> : time my indexer program is started a new "index version" is created. Now
> : each file which exists in the index and gets processed will have the
> : "index version" number stored as a document field.
> :
> : This way all newly added and modified documents will have an up to date
> : "index version" flag after indexing is complete.
> :
> : Now, to remove all physically deleted files from the index, I would
> : select all documents which have an old "index version" flag stored
> : inside them. Every document with such an old number can be safely
> : removed.
> :
> : Problem with this solution is, that *every* document in the index will
> : get updated: First the old index version field is removed, then the new
> : field is added.
> :
> : On the plusside, removing deleted files will be very fast.
> :
> :
> :
> :
> :
> : What would you recommend for keeping an incremental update?
> :
> : I fear the first version will be utterly slow for small updates whereas
> : the second version will be a lot faster - though adding stuff is slower
> : because of the additional field update for every document.
> :
> :
> :
> : Thanks for your advice,
> :
> : Johannes :-)
> :
> :
> :
> :
> :
> :
>
>
>
> -Hoss 



Mime
View raw message