lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject Re: How to use concurrency efficiently
Date Tue, 02 Apr 2013 14:39:32 GMT
Yes, the number of documents is not too large (about 90 000), but the queries are very hard.
Although they're just boolean, a typical query can produce a result with tens of millions
of hits.
Single-threadedly such a query runs ~20 seconds, which is too slow. therefore, multithreading
is vital for this task.

As you mentioned, merges are the source of non-uniform segments sizes. Therefore, as my index
is fully static (every time I need a re-index, I can do it from scratch), I'm gonna give a
try to NoMergePolicy with some reasonable maximum segment size.
If there are some other multithreading caveats, they're highly welcomed.

-- 
Best Regards,
Igor

02.04.2013, 18:07, "Adrien Grand" <jpountz@gmail.com>:
> On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov
> <ishalyminov@yandex-team.ru> wrote:
>
>>  Hello!
>
> Hi Igor,
>
>>  I have a ~20GB index and try to make a concurrent search over it.
>>  The index has 16 segments, I run SpanQuery.getSpans() on each segment concurrently.
>>  I see really small performance improvement of searching concurrently. I suppose,
the reason is that the sizes of the segments are very non-uniform (3 segments have ~20 000
docs each, and the others have less than 1 000 each).
>>  How to make more uniformly sized segments (I now use just writer.forceMerge(16)),
and are multiple index segments the most important thing in Lucene concurrency?
>
> Segments have non uniform sizes by design. A segment is generated
> every time a flush happens (when the ram buffer is full or if you
> explicitely call commit). When there are two many segments, Lucene
> merges some of them while new segments keep being generated as you add
> data. So the "flush" segments will always be small while segments
> resulting from a merge will be much larger since they contain data
> from several other segments.
>
> Even if segments are collected concurrently, IndexSearcher needs to
> merge the results of the collection of each segments in the end. Since
> your segments are very small (20000 docs), maybe the cost of
> initialization/merge is not negligible compared to single-segment
> collection.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message