lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <>
Subject Re: Realtime & distributed
Date Fri, 09 Oct 2009 02:56:53 GMT

Katta doesn't require HDFS which would be slow to search on,
though Katta can be used to copy indexes out of HDFS onto local
servers. The best bet is hardware that uses SSDs because merges
and update latency will greatly decrease and there won't be a
synchronous IO issue as there is with hard drives. Also, IO
caches get flushed as large merges occur, which means subsequent
queries may hit the HD and slow down. With SSDs this is much
less of an issue.

Today near realtime search (with or without SSDs) comes at a
price, that is reduced indexing speed due to continued in RAM
merging. People typically hack something together where indexes
are held in a RAMDir until being flushed to disk. The problem
with this is, merging in the background becomes really tricky
unless it's performed inside of IndexWriter (see LUCENE-1313 and
IW.getReader). There is the Zoie system which uses the RAMDir
solution, however it's implemented using a customized deleted
doc set based on a bloomfilter backed by an inefficient RB tree
which slows down queries. There's always a trade off when trying
to build an NRT system, currently.

Also, there isn't a clear way to replicate segments in realtime
so people usually end up analyzing documents on each replicated
node, which is redundant. A long term solution here could be a
distributed transaction log where encoded segments are stored
and replicated to N nodes.

Deletes can pile up in segments so the
BalancedSegmentMergePolicy could be used to remove those faster
than LogMergePolicy, however I haven't tested it, and it may be
trying to not do large segment merges altogether which IMO
is less than ideal because query performance soon degrades
(similar to an unoptimized index).

Hopefully in the future we can offer searching over
IndexWriter's RAM buffer where indexing and search speed would
be roughly what it is today. That combined with a way to insure
segments don't get flushed out of the IO cache during large
segment merges would mean really efficient NRT, even on systems
with HDs. In the interim, you'd need to play around and see what
works for your requirements.


On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <> wrote:
> Does anyone have any recommendations?  I've looked at Katta, but it doesn't
> seem to support realtime searching.  It also uses hdfs, which I've heard can
> be slow.  I'm looking to serve 40gb of indexes and support about 1 million
> updates per day.
> Thx
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message