Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 87911 invoked from network); 7 Nov 2007 20:40:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Nov 2007 20:40:15 -0000 Received: (qmail 38948 invoked by uid 500); 7 Nov 2007 20:40:01 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 38916 invoked by uid 500); 7 Nov 2007 20:40:01 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 38907 invoked by uid 99); 7 Nov 2007 20:40:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Nov 2007 12:40:01 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.191.106.159] (HELO smtp128.plus.mail.mud.yahoo.com) (209.191.106.159) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 07 Nov 2007 20:40:37 +0000 Received: (qmail 33420 invoked from network); 7 Nov 2007 20:39:41 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=plfu4VNlNZUrN+IuykYeA6J0RmK3dGxUBD2+Ww0o1hHPiTnGoXc7QVPfV/sJr2hmxL6T6MQsFISay0ZjdJTcMpfgUXFUqK4Rn4Bj4+xbEc9OwuTW3b7AiSEvPZv+OJ7Oxu59KCZJ1+x0KSHPfnrRuQ+nrTqlCS+rLU1epxCuYdM= ; Received: from unknown (HELO jonathan-hendlers-computer.local) (hendlerman@84.126.103.203 with plain) by smtp128.plus.mail.mud.yahoo.com with SMTP; 7 Nov 2007 20:39:41 -0000 Message-ID: <47322285.1070809@yahoo.com> Date: Wed, 07 Nov 2007 21:39:33 +0100 From: Jonathan Hendler User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728) MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: commodity vs. high perf machines: which would you rather References: <544270.265.qm@web81808.mail.mud.yahoo.com> In-Reply-To: <544270.265.qm@web81808.mail.mud.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I like these high level questions, as I have yet do do an actual install, but being an utter newbie, maybe my perspective will contribute something. I'd say option 3, which is: 1. One failsafe, strong machine for a master. 2. As many "commodity machines" as you can muster. With an emphasis on RAM and disk IO. My justification is that since the architecture is not P2P and requires Master to do all the scheduling. Of course, I could not understand something, and master might actually work fine as a weaker machine, which only does scheduling and routing, while the workers need to be bulky. I don't know the answer, but I suspect that kind of thinking might point you in the right direction. Another of other factors also get left out of decisions like this: 1. long term planning - how much would you scale in the future? Is this just a demo, or the production environment? 2. network environment - what kind of network will the machines be living in? 3. What is the MapReduce algorithm you are doing? Normally disk i/o bound, but maybe you're doing something "difficult" for CPU. I see from other posts ECC is important, and in general RAM can't hurt. An SSD would be nice too, to help with disk i/o. HTH - Jonathan Chris Fellows wrote: > Hello, > > Much of the hadoop documentation speaks to large clusters of commodity machines. There is a debate on our end about which would be better: a small number of high performance machines (2 boxes with 4 quad core processors) or X number of commodity machines. I feel that disk I/O might be the bottle neck with the 2 high perf machines (though I did just read in the FAQ about being able to split the dfs-data across multiple drives). > > So this is a "which would rather" question. If you were setting up a cluster of machines to perform data rollups/aggregation (and other mapred tasks) on files in the .25-1TB size, which would rather have: > > 1. 2 4 quad core machines with your choice on RAM and number of drives > 2. 10 (or more) commodity machines (as defined on the hadoop wiki) > > And of course a "why?" would be very helpful. > > Thanks! > > >