lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Concurrent indexing performance problem
Date Thu, 07 Mar 2013 17:44:36 GMT
This sounds reasonable (500 M docs / 50 GB index), though you'll need
to test resulting search perf for what you want to do with it.

To reduce merging time, maximize your IndexWriter RAM buffer
(setRAMBufferSizeMB).  You could also increase the
TieredMergePolicy.setSegmentsPerTier to allow more segments per level,
but note that while this causes less merging, it might mean slower
searching (since there are more segments to visit).

If possible, use an SSD: merging is quite a bit faster, and you can
increase the allowed max merge threads
(ConcurrentMergeScheduler.setMaxThreadCount); if you can't use an SSD,
then make sure maxThreadCount is 1.

4.x has concurrent flushing (see
http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html),
which is a big improvement in indexing rate ... but if merging is your
bottleneck then faster flushing won't help overall.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 7, 2013 at 11:44 AM, Jan Stette <jan.stette@gmail.com> wrote:
> I'm seeing performance problems when indexing a certain set of data, and
> I'm looking for pointers on how to improve the situation. I've read the
> very helpful performance advice on the Wiki and I am carrying on doing
> experiment based on that, but I'd also ask for comments as to whether I'm
> heading in the right direction.
>
> Basically, I'm indexing a collection of mostly very small documents, around
> 500 million of them. I'm doing this indexing from scratch, starting with an
> empty index. The resulting size of the index on disk is around 50 GB after
> indexing. I'm doing the indexing using a number of concurrent indexing
> threads, and using a single Lucene index. I'm on Lucene 3.6.1 currently,
> running on Linux.
>
> I'm looking at this process in a profiler, and what I'm seeing is that
> after a while, the indexing process ends up spending a lot of time in merge
> threads called "Lucene Merge Thread #NNN". Such merges seem to take around
> 50% of the overall time, during which all the indexing threads are locked
> out. Having run for less than an hour, I'm seeing merge threads numbered up
> to 270, so there have been frequent as well as long-running merges.
>
> Even when no merge is happening, there is a lot of contention between the
> indexing worker threads (there are around 12 of these).
>
> My questions are:
>
> - Is what I'm trying to do reasonable, i.e. the number of documents/overall
> size/single index?
> - What can I do to reduce the amount of time spent merging segments?
> - What can I do to improve concurrency of indexing?
>
> Any suggestions would be highly appreciated.
>
> Regards,
> Jan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message