Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8172B4C0E for ; Fri, 1 Jul 2011 08:18:40 +0000 (UTC) Received: (qmail 26007 invoked by uid 500); 1 Jul 2011 08:18:39 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 25173 invoked by uid 500); 1 Jul 2011 08:18:24 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 25098 invoked by uid 99); 1 Jul 2011 08:18:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 08:18:21 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [145.100.8.49] (HELO smtp.sara.nl) (145.100.8.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 08:18:13 +0000 Received: from planck.ka.sara.nl (145.100.8.32) by sara-exch-fe1.ka.sara.nl (145.100.8.49) with Microsoft SMTP Server (TLS) id 14.0.702.0; Fri, 1 Jul 2011 10:17:52 +0200 Received: from planck.ka.sara.nl ([145.100.8.32]) by planck.ka.sara.nl ([145.100.8.32]) with mapi; Fri, 1 Jul 2011 10:17:52 +0200 From: Evert Lammerts To: "general@hadoop.apache.org" Date: Fri, 1 Jul 2011 10:17:52 +0200 Subject: RE: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions) Thread-Topic: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions) Thread-Index: Acw3vuJJ0fUQ+jm6QcG+7oFU+kpgDAAADBBX Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org > What's the justification for a management interface? Doesn't that increas= e > complexity? Also you still twice the ports? That's true. It's a tradition that I haven't questioned before. The reasoni= ng, whether right or wrong, is that user jobs (our users are external) can = get in the way of admins. If there's a lot of network traffic it takes a lo= t longer to re-install a node. It also makes security easier - we have a se= perate SSH deamon listening on the management net accepting root logins - s= omething the default deamon does not do. Of course the same can be done by = running a seperate deamon on the public interface, on a port that is only a= ccessible from our internal network. Evert On Jun 30, 2011 11:54 PM, "Evert Lammerts" wrote: >>> Keeping the amount of disks per node low and the amount of nodes high >>> should keep the impact of dead nodes in control. >> >> It keeps the impact of dead nodes in control but I don't think thats >> long-term cost efficient. As prices of 10GbE go down, the "keep the node >> small" arguement seems less fitting. And on another note, most servers >> manufactured in the last 10 years have dual 1GbE network interfaces. If one >> were to go by these calcs: >> >> >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 >> minutes to recover >> >> It seems like that assumes a single 1GbE interface, why not leverage the >> second? > > I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. > > In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. > > Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. > > Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) > > Evert > > > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts wrote: > >> > You can get 12-24 TB in a server today, which means the loss of a server >> > generates a lot of traffic -which argues for 10 Gbe. >> > >> > But >> > -big increase in switch cost, especially if you (CoI warning) go with >> > Cisco >> > -there have been problems with things like BIOS PXE and lights out >> > management on 10 Gbe -probably due to the NICs being things the BIOS >> > wasn't expecting and off the mainboard. This should improve. >> > -I don't know how well linux works with ether that fast (field reports >> > useful) >> > -the big threat is still ToR switch failure, as that will trigger a >> > re-replication of every block in the rack. >> >> Keeping the amount of disks per node low and the amount of nodes high >> should keep the impact of dead nodes in control. A ToR switch failing is >> different - missing 30 nodes (~120TB) at once cannot be fixed by adding more >> nodes; that actually increases ToR switch failure. Although such failure is >> quite rare to begin with, I guess. The back-of-the-envelope-calculation = I >> made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., >> when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with >> HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing >> should take around 640 seconds. Also see the attached spreadsheet.) This >> doesn't take ToR switch failure in account though. On the other hand - 150 >> nodes is only ~5 racks - in such a scenario you might rather want to shu= t >> the system down completely rather than letting it replicate 20% of all data. >> >> Cheers, >> Evert