lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sokolov <>
Subject Re: Sharding Techniques
Date Tue, 10 May 2011 13:02:40 GMT

> Down to basics, Lucene searches work by locating terms and resolving
> documents from them. For standard term queries, a term is located by a
> process akin to binary search. That means that it uses log(n) seeks to
> get the term. Let's say you have 10M terms in your corpus. If you stored
> that in a single field in a single index with a single segment, it would
> take log(10M) ~= 24 seeks to locate a term. This is of course very
> simplified.
> When you have 63 indexes, log(n) works against you. Even with the
> unrealistic assumption that the 10M terms are evenly distributed and
> without duplicates, the number of seeks for a search that hits all parts
> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
> begun to estimate the merging part.
This is true, but if the indexes are kept on 63 separate servers, those 
seeks will be carried out in parallel.  The OP did indicate his indexes 
would be on different servers, I think?  I still agree with your overall 
point - at this scale a single server is probably best.  And if there 
are performance issues, I think the usual approach is to create multiple 
mirrored copies (slaves) rather than sharding.  Sharding is useful for 
very large indexes: indexes to big to store on disk and cache in memory 
on one commodity box


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message