lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Stette <jan.ste...@gmail.com>
Subject Re: Concurrent indexing performance problem
Date Thu, 07 Mar 2013 18:06:54 GMT
Thanks for your suggestions, Mike, I'll experiment with the RAM buffer size
and segments-per-tier settings and see what that does.

The time spent merging seems to be so great though, that I'm wondering if
I'm actually better off doing the indexing single-threaded. Am I right in
thinking that no merging happens if there's just a single thread writing to
the index? Or is merging a process that happens independently of how the
documents were written to the index?

I'm also wondering if I would be better off creating multiple indexes and
either merging them in one go after each index has been fully populated, or
alternatively do searches across multiple indexes. How would you expect
such a solution to perform by comparison?

Best regards,
Jan





On 7 March 2013 17:44, Michael McCandless <lucene@mikemccandless.com> wrote:

> This sounds reasonable (500 M docs / 50 GB index), though you'll need
> to test resulting search perf for what you want to do with it.
>
> To reduce merging time, maximize your IndexWriter RAM buffer
> (setRAMBufferSizeMB).  You could also increase the
> TieredMergePolicy.setSegmentsPerTier to allow more segments per level,
> but note that while this causes less merging, it might mean slower
> searching (since there are more segments to visit).
>
> If possible, use an SSD: merging is quite a bit faster, and you can
> increase the allowed max merge threads
> (ConcurrentMergeScheduler.setMaxThreadCount); if you can't use an SSD,
> then make sure maxThreadCount is 1.
>
> 4.x has concurrent flushing (see
>
> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> ),
> which is a big improvement in indexing rate ... but if merging is your
> bottleneck then faster flushing won't help overall.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 7, 2013 at 11:44 AM, Jan Stette <jan.stette@gmail.com> wrote:
> > I'm seeing performance problems when indexing a certain set of data, and
> > I'm looking for pointers on how to improve the situation. I've read the
> > very helpful performance advice on the Wiki and I am carrying on doing
> > experiment based on that, but I'd also ask for comments as to whether I'm
> > heading in the right direction.
> >
> > Basically, I'm indexing a collection of mostly very small documents,
> around
> > 500 million of them. I'm doing this indexing from scratch, starting with
> an
> > empty index. The resulting size of the index on disk is around 50 GB
> after
> > indexing. I'm doing the indexing using a number of concurrent indexing
> > threads, and using a single Lucene index. I'm on Lucene 3.6.1 currently,
> > running on Linux.
> >
> > I'm looking at this process in a profiler, and what I'm seeing is that
> > after a while, the indexing process ends up spending a lot of time in
> merge
> > threads called "Lucene Merge Thread #NNN". Such merges seem to take
> around
> > 50% of the overall time, during which all the indexing threads are locked
> > out. Having run for less than an hour, I'm seeing merge threads numbered
> up
> > to 270, so there have been frequent as well as long-running merges.
> >
> > Even when no merge is happening, there is a lot of contention between the
> > indexing worker threads (there are around 12 of these).
> >
> > My questions are:
> >
> > - Is what I'm trying to do reasonable, i.e. the number of
> documents/overall
> > size/single index?
> > - What can I do to reduce the amount of time spent merging segments?
> > - What can I do to improve concurrency of indexing?
> >
> > Any suggestions would be highly appreciated.
> >
> > Regards,
> > Jan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message