lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@gmail.com>
Subject Re: Directory flushing / commit / openIfChanged
Date Mon, 06 Aug 2012 11:55:53 GMT
hey harald,

On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch <Harald.Kirsch@raytion.com> wrote:
> Hi,
>
> in my application I have to write tons of small documents to the index, but
> with a twist. Many of the documents are actually aggregations of pieces of
> information that appear in a data stream, usually close together, but
> nevertheless merged with information for other documents.
>
> When information a1 for my document A arrives, I create my A-object, store
> it with index.addDocument() and forget about it. Later, when a2 arrives, I
> fetch A from the index, delete it from the index, update it, and store its
> updated version. To fetch it from the index, I use a reader retrieved with
> IndexReader.openIfChanged(). So for one piece of information I have roughly
> the following sequence:
>
>   get searcher via IndexReader.openIfChanged()
>   find previously stored document, if any
>   if document already available {
>     update document object
>     index.deleteDocument(new Term(IDFIELD, id))
>   } else {
>     create document object
>   }
>   index.addDocument()
>
>
> The overall speed is not too bad, but I wonder if more is possible. I
> changed RAMBufferSizeMB from the default 16 to 200 but saw no improvement in
> speed.
>
> I would think that keeping documents in RAM for some time such that many
> updates happen in RAM, rather then being written to disk would improve the
> overall running time.
>
> Any hints how to configure and use Lucene to improve the speed without
> layering my own caching on top of it?

what happens if you re-open a reader from an IW (NearRealtime) you
flush documents to disk each time you reopen the NRT reader. That
likely means if you have high update rates that you don't keep stuff
in memory for very long so ram buffer size increase won't help much.
What I would try to exploit is the fact that you only need to open a
new reader if the document (or its latest update) you are looking for
has not been flushed to disk yet ie. is not in reader you already have
opened. Lucene ships with some handy tools that helps you to implement
this. I'd likely use org.apache.lucene.search.NRTManager that exposes
the methods of IW (update/add/delete) and returns a sequence ID that
you can later use to request an NRT reader. Lets say you have document
X indexed with sequence ID 15 and you now wanna update it you look up
the ID of doc X in a hashmap or something like this to get the last
changed sequence ID then you ask the NRTManager to refresh the search
it holds right now with NRTManager#waitForGeneration(15) if the
generation is already refreshed it will return immediately otherwise
it will wait until its opened. Then you can just acquire a new
searcher and check the document.

something like this:

String id = doc.getId();
Long seqId = mapping.get(id);

if (seqId != null) {
  nrtManager.waitForGeneration(seqId);
}

IndexSearcher s = nrtManager.acquire();
try {
  IndexReader reader = s.getReader();
  // do something
} finally {
  nrtManager.release(s);
}

from time to time you can prune the mapping for sequence ids that are
already flushed.

hope that helps

simon
>
> Harald.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message