Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Evert Lammerts <Evert.Lammerts@sara.nl>
To: "general@hadoop.apache.org" <general@hadoop.apache.org>
Date: Fri, 1 Jul 2011 10:17:52 +0200
Subject: RE: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Thread-Topic: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Thread-Index: Acw3vuJJ0fUQ+jm6QcG+7oFU+kpgDAAADBBX
Message-ID: <ADF94D8555C7A246B86A633685E0178AB90A35CA9C@planck.ka.sara.nl>
References: 
 <ADF94D8555C7A246B86A633685E0178AB90A35CA99@planck.ka.sara.nl>,<BANLkTinx8+SarF_jxiCybmniGt+HJaH=2A@mail.gmail.com>
In-Reply-To: <BANLkTinx8+SarF_jxiCybmniGt+HJaH=2A@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

> What's the justification for a management interface? Doesn't that increas=
e
> complexity? Also you still twice the ports?

That's true. It's a tradition that I haven't questioned before. The reasoni=
ng, whether right or wrong, is that user jobs (our users are external) can =
get in the way of admins. If there's a lot of network traffic it takes a lo=
t longer to re-install a node. It also makes security easier - we have a se=
perate SSH deamon listening on the management net accepting root logins - s=
omething the default deamon does not do. Of course the same can be done by =
running a seperate deamon on the public interface, on a port that is only a=
ccessible from our internal network.

Evert

On Jun 30, 2011 11:54 PM, "Evert Lammerts" <Evert.Lammerts@sara.nl> wrote:
>>> Keeping the amount of disks per node low and the amount of nodes high
>>> should keep the impact of dead nodes in control.
>>
>> It keeps the impact of dead nodes in control but I don't think thats
>> long-term cost efficient. As prices of 10GbE go down, the "keep the node
>> small" arguement seems less fitting. And on another note, most servers
>> manufactured in the last 10 years have dual 1GbE network interfaces. If
one
>> were to go by these calcs:
>>
>> >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around
~32
>> minutes to recover
>>
>> It seems like that assumes a single 1GbE interface, why not leverage the
>> second?
>
> I don't know how others setup up their clusters. We have the tradition
that every node in a cluster has at least three interfaces - one for
interconnects, one for a management network (only reachable from within our
own network and the primary interface for our admins, accessible only
through a single management node) and one for ILOM, DRAC or whatever lights
out manager is available. This doesn't leave us room for bonding interfaces
on off the shelf nodes. Plus - you'd need twice as many ports in your
switch.
>
> In the case of Hadoop we're considering adding a fourth NIC for external
connectivity. We don't want people interacting with HDFS from outside while
jobs are using the interconnects.
>
> Of course the choice for 1 or 10Gb ethernet is a function of price. As
10Gb ethernet prices are approaching that of 1Gb ethernet it gets more
attractive. The recovery times scale linearly with ethernet speed, so 1Gb
ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a
difference. I'm just saying that since we have other variables to tweak -
amount of disks and number of nodes - we can limit the impact of minimizing
recovery times.
>
> Another thing to consider is that as 10Gb ethernet gets cheaper, it gets
more attractive to stop using HDFS (or at least, data locality) and start
using an external storage cluster. Compute node failure then has no impact,
disk failure is hardly noticed by compute nodes. But this is really still
very far from as cheap as many small nodes with relatively little disks - I
really like the bang for buck you get with Hadoop :-)
>
> Evert
>
>
> On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <Evert.Lammerts@sara.nl
>wrote:
>
>> > You can get 12-24 TB in a server today, which means the loss of a
server
>> > generates a lot of traffic -which argues for 10 Gbe.
>> >
>> > But
>> > -big increase in switch cost, especially if you (CoI warning) go with
>> > Cisco
>> > -there have been problems with things like BIOS PXE and lights out
>> > management on 10 Gbe -probably due to the NICs being things the BIOS
>> > wasn't expecting and off the mainboard. This should improve.
>> > -I don't know how well linux works with ether that fast (field reports
>> > useful)
>> > -the big threat is still ToR switch failure, as that will trigger a
>> > re-replication of every block in the rack.
>>
>> Keeping the amount of disks per node low and the amount of nodes high
>> should keep the impact of dead nodes in control. A ToR switch failing is
>> different - missing 30 nodes (~120TB) at once cannot be fixed by adding
more
>> nodes; that actually increases ToR switch failure. Although such failure
is
>> quite rare to begin with, I guess. The back-of-the-envelope-calculation =
I
>> made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet.
(e.g.,
>> when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each,
with
>> HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
>> should take around 640 seconds. Also see the attached spreadsheet.) This
>> doesn't take ToR switch failure in account though. On the other hand -
150
>> nodes is only ~5 racks - in such a scenario you might rather want to shu=
t
>> the system down completely rather than letting it replicate 20% of all
data.
>>
>> Cheers,
>> Evert