Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91C1810A81 for ; Thu, 12 Dec 2013 08:32:08 +0000 (UTC) Received: (qmail 56289 invoked by uid 500); 12 Dec 2013 08:32:03 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 56198 invoked by uid 500); 12 Dec 2013 08:32:03 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 55598 invoked by uid 99); 12 Dec 2013 08:32:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Dec 2013 08:32:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.225.24.68] (HELO sbexch03.sb.statsbiblioteket.dk) (130.225.24.68) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Dec 2013 08:31:54 +0000 Received: from [130.225.25.26] (130.225.25.26) by sbexch03.sb.statsbiblioteket.dk (130.225.24.68) with Microsoft SMTP Server id 8.3.342.0; Thu, 12 Dec 2013 09:31:33 +0100 Message-ID: <1386837093.4260.166.camel@te-prime> Subject: Re: Solr hardware memory question From: Toke Eskildsen Reply-To: te@statsbiblioteket.dk To: "solr-user@lucene.apache.org" Date: Thu, 12 Dec 2013 09:31:33 +0100 In-Reply-To: References: <74C97E7DF5A7784D997217FF75D1216612DDE59B@w2k3-bspex1> <1386763339.4260.95.camel@te-prime> Organization: State and University Library, Denmark Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.3-0ubuntu6 Content-Transfer-Encoding: 7bit MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org On Thu, 2013-12-12 at 02:46 +0100, Joel Bernstein wrote: > Curious how many documents per shard you were planning? 350-500 million, optimized to a single segment as the data are not changing. > The number of documents per shard and field type will drive the amount > of a RAM needed to sort and facet. Very true. It makes a lot of sense to separate RAM requirements for the Lucene/Solr structures and OS-caching. It seems that Gil is working on about the same project as we are, so I will elaborate in this thread: We would like to perform some sort of grouping on URL, so that the same page harvested at different points in time, is only displayed once. This is probably the heaviest functionality as the cardinality of the field will be near the number of documents. For plain(er) faceting, things like MIME-type, harvest date and site seems relevant. Those field have lower cardinality and they are single-valued so the memory requirements are something like #docs*log2(#unique_values) bits With 500M documents and 1000 values, that is 600MB. With 20 shards, we are looking at 12GB per simple facet field. Regards, Toke Eskildsen