Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of sokolov@ifactory.com
 designates 68.236.111.2 as permitted sender)
Message-ID: <4DC93770.8070905@ifactory.com>
Date: Tue, 10 May 2011 09:02:40 -0400
From: Mike Sokolov <sokolov@ifactory.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4
MIME-Version: 1.0
To: java-user@lucene.apache.org
CC: Toke Eskildsen <te@statsbiblioteket.dk>
Subject: Re: Sharding Techniques
References: <BANLkTimuytxpYnSRXR0qx6-H1VHxSExzKg@mail.gmail.com>
 <1305014508.8672.59.camel@te-prime>
In-Reply-To: <1305014508.8672.59.camel@te-prime>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


> Down to basics, Lucene searches work by locating terms and resolving
> documents from them. For standard term queries, a term is located by a
> process akin to binary search. That means that it uses log(n) seeks to
> get the term. Let's say you have 10M terms in your corpus. If you stored
> that in a single field in a single index with a single segment, it would
> take log(10M) ~= 24 seeks to locate a term. This is of course very
> simplified.
>
> When you have 63 indexes, log(n) works against you. Even with the
> unrealistic assumption that the 10M terms are evenly distributed and
> without duplicates, the number of seeks for a search that hits all parts
> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
> begun to estimate the merging part.
This is true, but if the indexes are kept on 63 separate servers, those 
seeks will be carried out in parallel.  The OP did indicate his indexes 
would be on different servers, I think?  I still agree with your overall 
point - at this scale a single server is probably best.  And if there 
are performance issues, I think the usual approach is to create multiple 
mirrored copies (slaves) rather than sharding.  Sharding is useful for 
very large indexes: indexes to big to store on disk and cache in memory 
on one commodity box

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org