Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D6AA310621 for ; Thu, 11 Apr 2013 19:40:37 +0000 (UTC) Received: (qmail 42684 invoked by uid 500); 11 Apr 2013 19:40:12 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 42619 invoked by uid 500); 11 Apr 2013 19:40:12 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 41174 invoked by uid 99); 11 Apr 2013 19:40:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Apr 2013 19:40:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of idokissos@gmail.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Apr 2013 19:32:28 +0000 Received: by mail-wi0-f170.google.com with SMTP id hm11so976799wib.5 for ; Thu, 11 Apr 2013 12:32:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=4Vzo6HyjriHtvESs7+FRFDYLPaLYycSCye/6DJMa5pc=; b=S2McsDwW1luQ98wL4nfy2NEXsqC2iy3cvAaEgKfEC6XajGOOWczZzdfuwIxOsvlVvY vQXgdooEd9UuvsuW15KglzrsbcMYjAYrYmg3qYyVWT+CHvfaigF5d0kJ8FUb6Xq+9SEG Yr+9KCDate8l7DoqB7dfF4yU5cUp3fUf9u3WgNKMQ4/sTdgGIk0j7lCEe8EO5i0ADZs9 BYCkbTXyKoua/XPu8V6GOF1uUDtf7Nq7NcWBDcm0S1c4axmbVrFZkM8qUej4gZB05Ph7 Fvokl9vOf9hM8DR2/kPoPakfQJsiRwzWVauSOA96lDqEnfd17yWSL9hJubQ77PdV24hm 7P2w== X-Received: by 10.180.103.40 with SMTP id ft8mr36491682wib.28.1365708727284; Thu, 11 Apr 2013 12:32:07 -0700 (PDT) MIME-Version: 1.0 Sender: idokissos@gmail.com Received: by 10.194.138.38 with HTTP; Thu, 11 Apr 2013 12:31:37 -0700 (PDT) In-Reply-To: <5164A941.6050509@elyograg.org> References: <51631828.1080202@elyograg.org> <51647C4C.1070707@elyograg.org> <5164A941.6050509@elyograg.org> From: Manuel Le Normand Date: Thu, 11 Apr 2013 22:31:37 +0300 X-Google-Sender-Auth: Fc-QxtQVX40AEY3ATovfBx4XVOs Message-ID: Subject: Re: Slow qTime for distributed search To: "solr-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=f46d044282ac8e803604da1ad7e3 X-Virus-Checked: Checked by ClamAV on apache.org --f46d044282ac8e803604da1ad7e3 Content-Type: text/plain; charset=ISO-8859-1 Hi, We have different working hours, sorry for the reply delay. Your assumed numbers are right, about 25-30Kb per doc. giving a total of 15G per shard, there are two shards per server (+2 slaves that should do no work normally). An average query has about 30 conditions (OR AND mixed), most of them textual, a small part on dateTime. They use only simple queries (no facet, filters etc.) as it is taken from the actual query set of my entreprise that works with an old search engine. As we said, if the shards in collection1 and collection2 have the same number of docs each (and same RAM & CPU per shard), it is apparently not a slow IO issue, right? So the fact of not having cached all my index doesn't seem the be the bottleneck.Moreover, i do store the fields but my query set requests only the id's and rarely snippets so I'd assume that the plenty of RAM i'd give the OS wouldn't make any difference as these *.fdt files don't need to get cached. The conclusion i get to is that the merging issue is the problem, and the only possibility of outsmarting it is to distribute to much fewer shards, meaning that i'll get back to few millions of docs per shard which are about linearly slower with the num of docs per shard. Though the latter should improve if i give much more RAM per server. I'll try tweaking a bit my schema and making better use of solr cache (filter query as an example), but i have something telling me the problem might be elsewhere. My main clue to it is that merging seems a simple CPU task, and tests show that even with a small amount of responses it takes a long time (and clearly the merging task on few docs is very short) On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey wrote: > On 4/9/2013 3:50 PM, Furkan KAMACI wrote: > >> Hi Shawn; >> >> You say that: >> >> *... your documents are about 50KB each. That would translate to an index >> that's at least 25GB* >> >> I know we can not say an exact size but what is the approximately ratio of >> document size / index size according to your experiences? >> > > If you store the fields, that is actual size plus a small amount of > overhead. Starting with Solr 4.1, stored fields are compressed. I believe > that it uses LZ4 compression. Some people store all fields, some people > store only a few or one - an ID field. The size of stored fields does have > an impact on how much OS disk cache you need, but not as much as the other > parts of an index. > > It's been my experience that termvectors take up almost as much space as > stored data for the same fields, and sometimes more. Starting with Solr > 4.2, termvectors are also compressed. > > Adding docValues (new in 4.2) to the schema will also make the index > larger. The requirements here are similar to stored fields. I do not know > whether this data gets compressed, but I don't think it does. > > As for the indexed data, this is where I am less clear about the storage > ratios, but I think you can count on it needing almost as much space as the > original data. If the schema uses types or filters that produce a lot of > information, the indexed data might be larger than the original input. > Examples of data explosions in a schema: trie fields with a non-zero > precisionStep, the edgengram filter, the shingle filter. > > Thanks, > Shawn > > --f46d044282ac8e803604da1ad7e3--