Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 18181 invoked from network); 29 Jun 2009 06:53:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jun 2009 06:53:55 -0000 Received: (qmail 58199 invoked by uid 500); 29 Jun 2009 06:54:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58124 invoked by uid 500); 29 Jun 2009 06:54:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58114 invoked by uid 99); 29 Jun 2009 06:54:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 06:54:02 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=FUZZY_CPILL,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [66.111.4.26] (HELO out2.smtp.messagingengine.com) (66.111.4.26) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 06:53:51 +0000 Received: from compute1.internal (compute1.internal [10.202.2.41]) by out1.messagingengine.com (Postfix) with ESMTP id 2C7193820F0 for ; Mon, 29 Jun 2009 02:53:30 -0400 (EDT) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by compute1.internal (MEProxy); Mon, 29 Jun 2009 02:53:30 -0400 X-Sasl-enc: kgve5m1ntL1w59jbFfG4cPHVbo+AUa+NJRQ0kua/YgQp 1246258409 Received: from [192.168.123.210] (unknown [86.43.104.130]) by mail.messagingengine.com (Postfix) with ESMTPSA id 80DC0D516 for ; Mon, 29 Jun 2009 02:53:29 -0400 (EDT) Message-ID: <4A4864E7.3070609@boboco.ie> Date: Mon, 29 Jun 2009 07:53:27 +0100 From: Eric Bowman User-Agent: Thunderbird 2.0.0.19 (X11/20090118) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Scaling out/up or a mix References: <7e536b1f0906261500m297efb0cv107e2b2c5cd94ac3@mail.gmail.com> <7e536b1f0906281413m276606ccyca58036de05708b6@mail.gmail.com> In-Reply-To: <7e536b1f0906281413m276606ccyca58036de05708b6@mail.gmail.com> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org There is no single answer -- this is always application specific. Without knowing anything about what you are doing: 1. disk i/o is probably the most critical. Go SSD or even RAM disk if you can, if performance is absolutely critical 2. Sometimes CPU can become an issue, but 8 cores is probably enough unless you are doing especially cpu-bound searches. Unless you are doing something with hard performance requirements, or really quite unusual, buying "good" kit is probably good enough, and you won't really know for sure until you measure. Lucene is a general enough tool that there isn't a terribly universal answer to this. We were a bit surprised to end up cpu-bound instead of disk i/o-bound, for instance, but we ended up taking an unusual path. YMMV. Marcus Herou wrote: > Hi. I think I need to be more specific. > > What I am trying to find out is if I should aim for: > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough. > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough... > RAM - if the index does not fit into RAM how much RAM should I then buy ? > > Please any hints would be appreciated since I am going to invest soon. > > //Marcus > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou > wrote: > > >> Hi. >> >> I currently have an index which is 16GB per machine (8 machines = 128GB) >> (data is stored externally, not in index) and is growing like crazy (we are >> indexing blogs which is crazy by nature) and have only allocated 2GB per >> machine to the Lucene app since we are running some other stuff there in >> parallell. >> >> Each doc should be roughly the size of a blog post, no more than 20k. >> >> We currently have about 90M documents and it is increasing rapidly so >> getting into the G+ document range is not going to be too far away. >> >> Now due to search performance I think I need to move these instances to >> dedicated index/search machines (or index on some machines and search on >> others). Anyway I would like to get some feedback about two things: >> >> 1. What is the most important hardware aspect when it comes to add document >> to the index and optimize it. >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?) >> 1.2 Is it RAM ? >> 1.3 Is is CPU ? >> >> My guess would be disk-io, right, wrong ? >> >> 2. What is the most important hardware aspect when it comes to searching >> documents in my setup ? (result-set is limited to return only the top 10 >> matches with page handling) >> 2.1 Is it disk read throughput ? (sequential or random-io ?) >> 2.2 Is it RAM ? >> 2.3 Is is CPU ? >> >> I have no clue since the data might not fit into memory. What is then the >> most important factor ? read-performance while scanning the index ? CPU >> while comparing fields and collecting results ? >> >> What I'm trying to find out is what I can do to get most bang for the buck >> with a limited (aren't we all limited?) budget. >> >> Kindly >> >> //Marcus >> >> >> >> >> >> -- >> Marcus Herou CTO and co-founder Tailsweep AB >> +46702561312 >> marcus.herou@tailsweep.com >> http://www.tailsweep.com/ >> >> >> > > > -- Eric Bowman Boboco Ltd ebowman@boboco.ie http://www.boboco.ie/ebowman/pubkey.pgp +35318394189/+353872801532 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org