lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samarendra Pratap <>
Subject Re: Sharding Techniques
Date Tue, 10 May 2011 14:42:46 GMT
Hi Mike,
*"I think the usual approach is to create multiple mirrored copies (slaves)
rather than sharding"*
This is where my eyes stuck.

 We do have mirrors and in-fact a good number of those. 6 servers are being
used for serving regular queries (2 are for specific queries that do take
time) and each of them receives around 3-3.5 K queries per hour in peak

 The problem is that the interface being used by end users has a lot of
options plus a few text boxes where they can type up to 64 words each. (and
unfortunately i am not able to reduce these things as these are business

 Normal queries go fine under 500 ms but when people start searching
"anything" some queries take up to > 100 seconds. Don't you think
distributing smaller indexes on different machines would reduce the average
search time. (Although I have a feeling that search time for smaller queries
may be slightly increased)

On Tue, May 10, 2011 at 6:32 PM, Mike Sokolov <> wrote:

>  Down to basics, Lucene searches work by locating terms and resolving
>> documents from them. For standard term queries, a term is located by a
>> process akin to binary search. That means that it uses log(n) seeks to
>> get the term. Let's say you have 10M terms in your corpus. If you stored
>> that in a single field in a single index with a single segment, it would
>> take log(10M) ~= 24 seeks to locate a term. This is of course very
>> simplified.
>> When you have 63 indexes, log(n) works against you. Even with the
>> unrealistic assumption that the 10M terms are evenly distributed and
>> without duplicates, the number of seeks for a search that hits all parts
>> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
>> begun to estimate the merging part.
> This is true, but if the indexes are kept on 63 separate servers, those
> seeks will be carried out in parallel.  The OP did indicate his indexes
> would be on different servers, I think?  I still agree with your overall
> point - at this scale a single server is probably best.  And if there are
> performance issues, I think the usual approach is to create multiple
> mirrored copies (slaves) rather than sharding.  Sharding is useful for very
> large indexes: indexes to big to store on disk and cache in memory on one
> commodity box
> -Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message