lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@gmail.com>
Subject Re: Concurrent indexing performance problem
Date Thu, 07 Mar 2013 19:47:59 GMT
On Thu, Mar 7, 2013 at 7:06 PM, Jan Stette <jan.stette@gmail.com> wrote:
> Thanks for your suggestions, Mike, I'll experiment with the RAM buffer size
> and segments-per-tier settings and see what that does.
>
> The time spent merging seems to be so great though, that I'm wondering if
> I'm actually better off doing the indexing single-threaded. Am I right in
> thinking that no merging happens if there's just a single thread writing to
> the index? Or is merging a process that happens independently of how the
> documents were written to the index?

no that is not true. no matter how many threads you use in lucene 3.6
you will create the same amount of segments. (note this is not true in
>= 4.0)
What would be interesting to see is what happens if you set
maxMergeSize so that you prevent large segments from being merged. You
also might want to use a SerialMergeScheduler to make sure you are not
merging to many segments at once.
>
> I'm also wondering if I would be better off creating multiple indexes and
> either merging them in one go after each index has been fully populated, or
> alternatively do searches across multiple indexes. How would you expect
> such a solution to perform by comparison?

if you can do that against different harddisks that will certainly
give you a boost since this process i pretty IO Bound I would guess.

simon
>
> Best regards,
> Jan
>
>
>
>
>
> On 7 March 2013 17:44, Michael McCandless <lucene@mikemccandless.com> wrote:
>
>> This sounds reasonable (500 M docs / 50 GB index), though you'll need
>> to test resulting search perf for what you want to do with it.
>>
>> To reduce merging time, maximize your IndexWriter RAM buffer
>> (setRAMBufferSizeMB).  You could also increase the
>> TieredMergePolicy.setSegmentsPerTier to allow more segments per level,
>> but note that while this causes less merging, it might mean slower
>> searching (since there are more segments to visit).
>>
>> If possible, use an SSD: merging is quite a bit faster, and you can
>> increase the allowed max merge threads
>> (ConcurrentMergeScheduler.setMaxThreadCount); if you can't use an SSD,
>> then make sure maxThreadCount is 1.
>>
>> 4.x has concurrent flushing (see
>>
>> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>> ),
>> which is a big improvement in indexing rate ... but if merging is your
>> bottleneck then faster flushing won't help overall.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 7, 2013 at 11:44 AM, Jan Stette <jan.stette@gmail.com> wrote:
>> > I'm seeing performance problems when indexing a certain set of data, and
>> > I'm looking for pointers on how to improve the situation. I've read the
>> > very helpful performance advice on the Wiki and I am carrying on doing
>> > experiment based on that, but I'd also ask for comments as to whether I'm
>> > heading in the right direction.
>> >
>> > Basically, I'm indexing a collection of mostly very small documents,
>> around
>> > 500 million of them. I'm doing this indexing from scratch, starting with
>> an
>> > empty index. The resulting size of the index on disk is around 50 GB
>> after
>> > indexing. I'm doing the indexing using a number of concurrent indexing
>> > threads, and using a single Lucene index. I'm on Lucene 3.6.1 currently,
>> > running on Linux.
>> >
>> > I'm looking at this process in a profiler, and what I'm seeing is that
>> > after a while, the indexing process ends up spending a lot of time in
>> merge
>> > threads called "Lucene Merge Thread #NNN". Such merges seem to take
>> around
>> > 50% of the overall time, during which all the indexing threads are locked
>> > out. Having run for less than an hour, I'm seeing merge threads numbered
>> up
>> > to 270, so there have been frequent as well as long-running merges.
>> >
>> > Even when no merge is happening, there is a lot of contention between the
>> > indexing worker threads (there are around 12 of these).
>> >
>> > My questions are:
>> >
>> > - Is what I'm trying to do reasonable, i.e. the number of
>> documents/overall
>> > size/single index?
>> > - What can I do to reduce the amount of time spent merging segments?
>> > - What can I do to improve concurrency of indexing?
>> >
>> > Any suggestions would be highly appreciated.
>> >
>> > Regards,
>> > Jan
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message