Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B6E3A66E4 for ; Thu, 30 Jun 2011 21:43:02 +0000 (UTC) Received: (qmail 85488 invoked by uid 500); 30 Jun 2011 21:43:01 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 85434 invoked by uid 500); 30 Jun 2011 21:43:00 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 85426 invoked by uid 99); 30 Jun 2011 21:43:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jun 2011 21:43:00 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [145.100.8.49] (HELO smtp.sara.nl) (145.100.8.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jun 2011 21:42:55 +0000 Received: from planck.ka.sara.nl (145.100.8.32) by sara-exch-fe1.ka.sara.nl (145.100.8.49) with Microsoft SMTP Server (TLS) id 14.0.702.0; Thu, 30 Jun 2011 23:42:33 +0200 Received: from planck.ka.sara.nl ([145.100.8.32]) by planck.ka.sara.nl ([145.100.8.32]) with mapi; Thu, 30 Jun 2011 23:42:33 +0200 From: Evert Lammerts To: Abhishek Mehta , "general@hadoop.apache.org" Date: Thu, 30 Jun 2011 23:40:15 +0200 Subject: RE: Hadoop Java Versions Thread-Topic: Hadoop Java Versions Thread-Index: Acw3biTCPQz0yvRMRFOO80O3HF/alAAACfbg Message-ID: References: ,<93245D75-B596-4AC0-9733-BB2FF4E458E5@tresata.com> In-Reply-To: <93245D75-B596-4AC0-9733-BB2FF4E458E5@tresata.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 That's not a question I'm qualified to answer. I do know we're now buying a= n Arista for a different cluster, but there's probably loads others out the= re. *forwarded to general@...* ________________________________________ From: Abhishek Mehta [abhishek@tresata.com] Sent: Thursday, June 30, 2011 11:38 PM To: Evert Lammerts Subject: Fwd: Hadoop Java Versions what are the other switch options (other than cisco that is?)? cheers Abhishek Mehta (e) abhishek@tresata.com (v) 980.355.9855 Begin forwarded message: From: Evert Lammerts = > Date: June 30, 2011 5:31:26 PM EDT To: "general@hadoop.apache.org" > Subject: RE: Hadoop Java Versions Reply-To: general@hadoop.apache.org You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high shoul= d keep the impact of dead nodes in control. A ToR switch failing is differe= nt - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes= ; that actually increases ToR switch failure. Although such failure is quit= e rare to begin with, I guess. The back-of-the-envelope-calculation I made = suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when= 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HD= FS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should= take around 640 seconds. Also see the attached spreadsheet.) This doesn't = take ToR switch failure in account though. On the other hand - 150 nodes is= only ~5 racks - in such a scenario you might rather want to shut the syste= m down completely rather than letting it replicate 20% of all data. Cheers, Evert