lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harald Kirsch <Harald.Kir...@raytion.com>
Subject Re: Directory flushing / commit / openIfChanged
Date Fri, 10 Aug 2012 10:10:12 GMT
Maybe I did something wrong, maybe it does indeed not help, but pushing 
data into Lucene was not any faster than before.

I would like remove my project specific baggage and try to rephrase my 
question by means of a simple example.

Suppose a Lucene document is used to count events of certain types. For 
each type of event I have one document. Whenever a new event arrives, I 
must read the respective document from the index, increment the count, 
delete the document from the index and write the new one into the index.

As an addition, consider the distribution of events being a typical 
Zipf, i.e. a small number of event types occurs rather frequently, while 
other types of events may appear just once.

What is the most efficient sequence of Lucene operations for such a 
scenario?

Harald.

On 07.08.2012 15:39, Harald Kirsch wrote:
> Hello Simon,
>
> ok, I'll try this out. Just to be sure. I was after a way to update
> documents before they are even written to disk, but this seems not to be
> the Lucene way. From what you propose I understand that this approach
> tries to keep documents from being written up to the time they need to
> be actually changed.
>
> If I need to keep some kind map anyway myself, I wonder if I will not
> just cache the documents themselves rather than just their sequence id.
> If they are "old" enough I migrate them into the index. For the sequence
> IDs I would need a retirement strategy too.
>
> It was exactly this additional caching that I hoped to avoid. :-(
>
> Harald.
>
>
>
> On 06.08.2012 13:55, Simon Willnauer wrote:
>> hey harald,
>>
>> On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch
>> <Harald.Kirsch@raytion.com> wrote:
>>> Hi,
>>>
>>> in my application I have to write tons of small documents to the
>>> index, but
>>> with a twist. Many of the documents are actually aggregations of
>>> pieces of
>>> information that appear in a data stream, usually close together, but
>>> nevertheless merged with information for other documents.
>>>
>>> When information a1 for my document A arrives, I create my A-object,
>>> store
>>> it with index.addDocument() and forget about it. Later, when a2
>>> arrives, I
>>> fetch A from the index, delete it from the index, update it, and
>>> store its
>>> updated version. To fetch it from the index, I use a reader retrieved
>>> with
>>> IndexReader.openIfChanged(). So for one piece of information I have
>>> roughly
>>> the following sequence:
>>>
>>>    get searcher via IndexReader.openIfChanged()
>>>    find previously stored document, if any
>>>    if document already available {
>>>      update document object
>>>      index.deleteDocument(new Term(IDFIELD, id))
>>>    } else {
>>>      create document object
>>>    }
>>>    index.addDocument()
>>>
>>>
>>> The overall speed is not too bad, but I wonder if more is possible. I
>>> changed RAMBufferSizeMB from the default 16 to 200 but saw no
>>> improvement in
>>> speed.
>>>
>>> I would think that keeping documents in RAM for some time such that many
>>> updates happen in RAM, rather then being written to disk would
>>> improve the
>>> overall running time.
>>>
>>> Any hints how to configure and use Lucene to improve the speed without
>>> layering my own caching on top of it?
>>
>> what happens if you re-open a reader from an IW (NearRealtime) you
>> flush documents to disk each time you reopen the NRT reader. That
>> likely means if you have high update rates that you don't keep stuff
>> in memory for very long so ram buffer size increase won't help much.
>> What I would try to exploit is the fact that you only need to open a
>> new reader if the document (or its latest update) you are looking for
>> has not been flushed to disk yet ie. is not in reader you already have
>> opened. Lucene ships with some handy tools that helps you to implement
>> this. I'd likely use org.apache.lucene.search.NRTManager that exposes
>> the methods of IW (update/add/delete) and returns a sequence ID that
>> you can later use to request an NRT reader. Lets say you have document
>> X indexed with sequence ID 15 and you now wanna update it you look up
>> the ID of doc X in a hashmap or something like this to get the last
>> changed sequence ID then you ask the NRTManager to refresh the search
>> it holds right now with NRTManager#waitForGeneration(15) if the
>> generation is already refreshed it will return immediately otherwise
>> it will wait until its opened. Then you can just acquire a new
>> searcher and check the document.
>>
>> something like this:
>>
>> String id = doc.getId();
>> Long seqId = mapping.get(id);
>>
>> if (seqId != null) {
>>    nrtManager.waitForGeneration(seqId);
>> }
>>
>> IndexSearcher s = nrtManager.acquire();
>> try {
>>    IndexReader reader = s.getReader();
>>    // do something
>> } finally {
>>    nrtManager.release(s);
>> }
>>
>> from time to time you can prune the mapping for sequence ids that are
>> already flushed.
>>
>> hope that helps
>>
>> simon
>>>
>>> Harald.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message