lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Realtime & distributed
Date Fri, 09 Oct 2009 03:18:20 GMT

On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
> wrote:

> Today near realtime search (with or without SSDs) comes at a
> price, that is reduced indexing speed due to continued in RAM
> merging. People typically hack something together where indexes
> are held in a RAMDir until being flushed to disk. The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> IW.getReader). There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries. There's always a trade off when trying
> to build an NRT system, currently.

  I'm not sure what numbers you are using to justify saying that zoie
"slows down queries" - latency at LinkedIn using zoie has a typical
median response time of 4-8ms at the searcher node level (slower
at the broker due to a lot of custom stuff that happens before
queries are actually sent to the nodex), while dealing with sustained
rapid indexing throughput, all with basically zero time between indexing
event to index visibility (ie. true real-time, not "near real time", unless
indexing events are coming in *very* fast).

  You say there's a tradeoff, but as you should remember from your
time at LinkedIn, we do distributed realtime faceted search while
maintaining extremely low latency and still indexing sometimes more
than a thousand new docs a minute per node (I should dredge up
some new numbers to verify what that is exactly these days).

Deletes can pile up in segments so the
> BalancedSegmentMergePolicy could be used to remove those faster
> than LogMergePolicy, however I haven't tested it, and it may be
> trying to not do large segment merges altogether which IMO
> is less than ideal because query performance soon degrades
> (similar to an unoptimized index).

Not optimizing all the way has shown in our case to actually be
*better* than the "optimal" case of a 1-segment index, at least in
the case of realtime indexing at rapid update pace.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message