lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harald Kirsch <Harald.Kir...@raytion.com>
Subject Re: Directory flushing / commit / openIfChanged
Date Tue, 07 Aug 2012 13:39:42 GMT
Hello Simon,

ok, I'll try this out. Just to be sure. I was after a way to update 
documents before they are even written to disk, but this seems not to be 
the Lucene way. From what you propose I understand that this approach 
tries to keep documents from being written up to the time they need to 
be actually changed.

If I need to keep some kind map anyway myself, I wonder if I will not 
just cache the documents themselves rather than just their sequence id. 
If they are "old" enough I migrate them into the index. For the sequence 
IDs I would need a retirement strategy too.

It was exactly this additional caching that I hoped to avoid. :-(

Harald.



On 06.08.2012 13:55, Simon Willnauer wrote:
> hey harald,
>
> On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch <Harald.Kirsch@raytion.com> wrote:
>> Hi,
>>
>> in my application I have to write tons of small documents to the index, but
>> with a twist. Many of the documents are actually aggregations of pieces of
>> information that appear in a data stream, usually close together, but
>> nevertheless merged with information for other documents.
>>
>> When information a1 for my document A arrives, I create my A-object, store
>> it with index.addDocument() and forget about it. Later, when a2 arrives, I
>> fetch A from the index, delete it from the index, update it, and store its
>> updated version. To fetch it from the index, I use a reader retrieved with
>> IndexReader.openIfChanged(). So for one piece of information I have roughly
>> the following sequence:
>>
>>    get searcher via IndexReader.openIfChanged()
>>    find previously stored document, if any
>>    if document already available {
>>      update document object
>>      index.deleteDocument(new Term(IDFIELD, id))
>>    } else {
>>      create document object
>>    }
>>    index.addDocument()
>>
>>
>> The overall speed is not too bad, but I wonder if more is possible. I
>> changed RAMBufferSizeMB from the default 16 to 200 but saw no improvement in
>> speed.
>>
>> I would think that keeping documents in RAM for some time such that many
>> updates happen in RAM, rather then being written to disk would improve the
>> overall running time.
>>
>> Any hints how to configure and use Lucene to improve the speed without
>> layering my own caching on top of it?
>
> what happens if you re-open a reader from an IW (NearRealtime) you
> flush documents to disk each time you reopen the NRT reader. That
> likely means if you have high update rates that you don't keep stuff
> in memory for very long so ram buffer size increase won't help much.
> What I would try to exploit is the fact that you only need to open a
> new reader if the document (or its latest update) you are looking for
> has not been flushed to disk yet ie. is not in reader you already have
> opened. Lucene ships with some handy tools that helps you to implement
> this. I'd likely use org.apache.lucene.search.NRTManager that exposes
> the methods of IW (update/add/delete) and returns a sequence ID that
> you can later use to request an NRT reader. Lets say you have document
> X indexed with sequence ID 15 and you now wanna update it you look up
> the ID of doc X in a hashmap or something like this to get the last
> changed sequence ID then you ask the NRTManager to refresh the search
> it holds right now with NRTManager#waitForGeneration(15) if the
> generation is already refreshed it will return immediately otherwise
> it will wait until its opened. Then you can just acquire a new
> searcher and check the document.
>
> something like this:
>
> String id = doc.getId();
> Long seqId = mapping.get(id);
>
> if (seqId != null) {
>    nrtManager.waitForGeneration(seqId);
> }
>
> IndexSearcher s = nrtManager.acquire();
> try {
>    IndexReader reader = s.getReader();
>    // do something
> } finally {
>    nrtManager.release(s);
> }
>
> from time to time you can prune the mapping for sequence ids that are
> already flushed.
>
> hope that helps
>
> simon
>>
>> Harald.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message