lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Vasilev" <>
Subject Re: Lucene index update
Date Fri, 05 Jan 2007 09:25:53 GMT
Thanks a lot for your help Erick,

I make simple class that uses your idea about extracting data that is 
indexed but unstored by using TermDocs/TermEnums classes. It works properly.
Yes I also think that in some cases the approach with re-indexing documents 
will be faster. It depends on the documents that will be updated - if for 
example they are short and contain almost only different words the 
re-indexing will be faster, but if they are big documents and contain not so 
many different words then the trick with the TermDocs/TermEnums will be 
faster. Also this is the case when gathering documents from some network, as 
you mentioned.

My task is not to update some indexes but to create a tool that does it. In 
this tool I can include both classes and give user chance to choose what 
approach to use by giving him advise in which cases which approach is 

Thank you again for your help :).

By the way could you give me some link from where I can read about the Luke 
program that reconstructs unstored fields, that you mentioned?

Bets Regards,

----- Original Message ----- 
From: "Erick Erickson" <>
To: <>
Sent: Thursday, January 04, 2007 5:40 PM
Subject: Re: Lucene index update

> for approach <2>, I *think* you can extract information about unstored 
> data
> by playing with TermDocs/TermEnums. Conceptually, the idea is to go 
> through
> all the terms and, for document (lucene ID) 1 find the terms that appear 
> in
> document 1 and order them by their termpositions. Repeat for document 2
> etc.. WARNING: I haven't done this personally, so I may be way off base.
> Also, if the Luke source code is available, that program reconstructs
> unstored fields, so you might want to take a look at that. But this will
> inevitably be a lossy process.
> I have to ask, though, why you're rejecting option <4> so early. To modify 
> a
> document in an index, you have to delete the document and re-add it. 
> Unless
> the time to assemble the document to be re-indexed is prohibitive, I'd 
> think
> about this option more seriously.
> Remember that not all documents in an index have to have the same fields.
> So, if you're only adding meta-data to *some* documents, you can freely 
> just
> change those docs and ignore all the others. Your searcher does have to 
> deal
> with null fields in some documents in this case.
> If you're going to re-index all documents, I'd question whether playing
> around with re-constructing them from the index is really a speed savings.
> Again, it will be if the time to assemble each document is large (say lots
> of DB queries or web crawling). But even in this case, is it the 
> difference
> between 1 hour and 3 hours? or 1 day and two weeks? If you can get it 
> built
> in a night, I'd do it the simple way.
> How long did it take to create the index originally anyway?
> Best
> Erick
> On 1/4/07, Ivan Vasilev <> wrote:
>> Hi All,
>> I want to update some documents in existing indexes by adding a new field
>> to
>> each of their documents. The documents contained in the indexes have some
>> fields that are indexed and NOT stored. The new field that will be added
>> will contain some metadata and will be Stored and not indexed.
>> To fulfill the task I think I have 4 possible approaches:
>> 1. To use Lucene API to do this. According to my research I found that
>> Lucene does not have methods for updating documents directly.
>> 2. To extract documents including their fields (including the most
>> important
>> fields - those that are indexed but not stored). Then to add the new 
>> field
>> with relevant value to these documents, delete them from the index and
>> finally add them again in the index (including their new field).
>>     This is the approach recommended in the book "Lucene in Action" and 
>> in
>> the FAQ page on the Lucene's site - delete documents and add them again.
>>     The problem here is that according to Lucene's documentation "fields
>> which are not stored are not available in documents retrieved from the
>> index, e.g. with Hits.doc(int), Searcher.doc(int) or
>> IndexReader.document(int)". (The tests showed the same :) ). So if this
>> approach could work the problem here is: How to extract information of 
>> the
>> fields that are indexed but not stored?
>> 3. To create a tool that changes the data stored in the indexes. This
>> approach seems to be too "dangerous" but if anyone knows the format that
>> Lucene uses to keep these data please tell me.
>> 4. To reindex - means to delete existing indexes and create them in new 
>> by
>> adding the necessary fields. As the indexes could be very large creating
>> them in new will be too slow process and so this approach is least
>> desirable. But I am afraid it is the only one that is practically
>> possible.
>> If someone could help me with the approaches 1-3 or tell me some other
>> please help me :)
>> Thanks in advance,
>> Ivan
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message