lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@gmail.com>
Subject Re: Concurrent indexing performance problem
Date Thu, 07 Mar 2013 19:50:06 GMT
On Thu, Mar 7, 2013 at 6:44 PM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> This sounds reasonable (500 M docs / 50 GB index), though you'll need
> to test resulting search perf for what you want to do with it.
>
> To reduce merging time, maximize your IndexWriter RAM buffer
> (setRAMBufferSizeMB).  You could also increase the
> TieredMergePolicy.setSegmentsPerTier to allow more segments per level,
> but note that while this causes less merging, it might mean slower
> searching (since there are more segments to visit).
>
> If possible, use an SSD: merging is quite a bit faster, and you can
> increase the allowed max merge threads
> (ConcurrentMergeScheduler.setMaxThreadCount); if you can't use an SSD,
> then make sure maxThreadCount is 1.
>
> 4.x has concurrent flushing (see
> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html),
> which is a big improvement in indexing rate ... but if merging is your
> bottleneck then faster flushing won't help overall.

I am no sure actually. The funny thing here is that with concurrent
flushing you are creating way more segments on disk which also means
you need to merge more segments but it also means you MP can make
better decisions and merge smaller segments first. I am not convinced
that this wound not help you. Especially if you keep the background
process merging this could be a win overall.

simon
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 7, 2013 at 11:44 AM, Jan Stette <jan.stette@gmail.com> wrote:
>> I'm seeing performance problems when indexing a certain set of data, and
>> I'm looking for pointers on how to improve the situation. I've read the
>> very helpful performance advice on the Wiki and I am carrying on doing
>> experiment based on that, but I'd also ask for comments as to whether I'm
>> heading in the right direction.
>>
>> Basically, I'm indexing a collection of mostly very small documents, around
>> 500 million of them. I'm doing this indexing from scratch, starting with an
>> empty index. The resulting size of the index on disk is around 50 GB after
>> indexing. I'm doing the indexing using a number of concurrent indexing
>> threads, and using a single Lucene index. I'm on Lucene 3.6.1 currently,
>> running on Linux.
>>
>> I'm looking at this process in a profiler, and what I'm seeing is that
>> after a while, the indexing process ends up spending a lot of time in merge
>> threads called "Lucene Merge Thread #NNN". Such merges seem to take around
>> 50% of the overall time, during which all the indexing threads are locked
>> out. Having run for less than an hour, I'm seeing merge threads numbered up
>> to 270, so there have been frequent as well as long-running merges.
>>
>> Even when no merge is happening, there is a lot of contention between the
>> indexing worker threads (there are around 12 of these).
>>
>> My questions are:
>>
>> - Is what I'm trying to do reasonable, i.e. the number of documents/overall
>> size/single index?
>> - What can I do to reduce the amount of time spent merging segments?
>> - What can I do to improve concurrency of indexing?
>>
>> Any suggestions would be highly appreciated.
>>
>> Regards,
>> Jan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message