lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: How to make good use of the multithreaded IndexSearcher?
Date Tue, 01 Oct 2013 20:07:24 GMT
On Tue, Oct 1, 2013 at 3:58 PM, Desidero <desidero@gmail.com> wrote:
> Benson,
>
> Rather than forcing a random number of small segments into the index using
> maxMergedSegmentMB, it might be better to split your index into multiple
> shards. You can create a specific number of balanced shards to control the
> parallelism and then forceMerge each shard down to 1 segment to avoid
> spawning extra threads per shard. Once that's done, you just open all of
> the shards with a MultiReader and use that with the IndexSearcher and an
> ExecutorService.
>
> The downside to this is that it doesn't play nicely with near real-time
> search, but if you have a relatively static index that gets pushed to
> slaves periodically it gets the job done.
>
> As Mike said, it'd be nicer if there was a way to split the docID space
> into virtual shards, but it's not currently available. I'm not sure if
> anyone is even looking into it.

Thanks, folks, for all the help. I'm musing about the top-level issue
here, which is whether the important case is many independent queries
or latency of just one.  In the case where it's just one, we'll follow
the shard-related advice.




>
> Regards,
> Matt
>
>
> On Tue, Oct 1, 2013 at 7:09 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> You might want to set a smallish maxMergedSegmentMB in
>> TieredMergePolicy to "force" enough segments in the index ... sort of
>> the opposite of optimizing.
>>
>> Really, IndexSearcher's approach to using one thread per segment is
>> rather silly, and, it's annoying/bad to expose change in behavior due
>> to segment structure.
>>
>> I think it'd be better to carve up the overall docID space into N
>> virtual shards.  Ie, if you have 100M docs, then one thread searches
>> docs 0-10M, another 10M-20M, etc.  Nobody has created such a searcher
>> impl but it should not be hard and it would be agnostic to the segment
>> structure.
>>
>> But then again, this need (using concurrent hardware to reduce latency
>> of a single query) is somewhat rare; most apps are fine using the
>> concurrency across queries rather than within one query.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand <jpountz@gmail.com> wrote:
>> > Hi Benson,
>> >
>> > On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies <benson@basistech.com>
>> wrote:
>> >> The multithreaded index searcher fans out across segments. How
>> aggressively
>> >> does 'optimize' reduce the number of segments? If the segment count goes
>> >> way down, is there some other way to exploit multiple cores?
>> >
>> > forceMerge[1], formerly known as optimize, takes a parameter to
>> > configure how many segments should remain in the index.
>> >
>> > Regarding multi-core usage, if your query load is high enough to use
>> > all you CPUs (there are alwas #cores queries running in parrallel),
>> > there is generally no need to use the multi-threaded IndexSearcher.
>> > The multi-threaded index searcher can however help in case all CPU
>> > power is not in use or if you care more about latency than throughput.
>> > It indeed leverages the fact that the index is splitted into segments
>> > to parallelize query execution, so a fully merged index will actually
>> > run the query in a single thread in any case.
>> >
>> > There is no way to make query execution efficiently use several cores
>> > on a single-segment index so if you really want to parallelize query
>> > execution, you will have to shard the index to do at the index level
>> > what the multi-threaded IndexSearcher does at the segment level.
>> >
>> > Side notes:
>> >  - A single segment index only runs more efficiently queries which are
>> > terms-dictionary-intensive, it is generally discouraged to run
>> > forceMerge on an index unless this index is read-only.
>> >  - The multi-threaded index searcher only parallelizes query execution
>> > in certain cases. In particular, it never parallelizes execution when
>> > the method takes a collector. This means that if you want to use
>> > TotalHitCountCollector to count matches, you will have to do the
>> > parallelization by yourself.
>> >
>> > [1]
>> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29
>> >
>> > --
>> > Adrien
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message