lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samarendra Pratap <samarz...@gmail.com>
Subject Re: Sharding Techniques
Date Tue, 10 May 2011 09:22:09 GMT
Thanks
 to Johannes - I am looking into katta. Seems promising.
 to Toke - Great explanation. That's what I was looking for.

 I'll come back and share my experience.
Thank you very much.


On Tue, May 10, 2011 at 1:31 PM, Toke Eskildsen <te@statsbiblioteket.dk>wrote:

> On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote:
> >  We have an index directory of 30 GB which is divided into 3
> subdirectories
> > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> > (idx1-1, idx1-2, ...., idx2-1, ...., idx3-1, ...., idx3-21).
>
> So each part is about ½ GB in size? That gives you a serious logistic
> overhead. You state later that you only update the index once a day, so
> it would seem that you have no need for the fast update times that such
> small indexes give you. My guess is that you will get faster search
> times by using a single index.
>
>
> Down to basics, Lucene searches work by locating terms and resolving
> documents from them. For standard term queries, a term is located by a
> process akin to binary search. That means that it uses log(n) seeks to
> get the term. Let's say you have 10M terms in your corpus. If you stored
> that in a single field in a single index with a single segment, it would
> take log(10M) ~= 24 seeks to locate a term. This is of course very
> simplified.
>
> When you have 63 indexes, log(n) works against you. Even with the
> unrealistic assumption that the 10M terms are evenly distributed and
> without duplicates, the number of seeks for a search that hits all parts
> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
> begun to estimate the merging part.
>
> Due to caching, a seek is not equal to the storage being hit, but the
> probability for a storage hit rises with the number of seeks and the
> inevitable term duplicates when splitting the index.
>
> > We have almost 40 fields in each index (is it a bad to have so many
> > fields?). most of them are id based fields.
>
> Nah, our index is about 40GB with 100+ fields and 8M documents. We use a
> single index, optimized to 5 segments. Response times for raw searches
> are a few ms, while response times for the full package (heavy faceting)
> is generally below 300ms. Our queries are mostly simple boolean queries
> across 13 fields.
>
> > Keeping parts of indexes on different servers search on all of them and
> then
> > merging the results - what could be the best approach?
>
> Locate your bottleneck. Some well-placed log statements or a quick peek
> with visualvm (comes with the Oracle JVM) should help a lot.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards,
Samar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message