hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Hadoop Java Versions
Date Fri, 01 Jul 2011 00:24:00 GMT
On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> You have to consider the long-term reliability as well.
>
> Losing an entire set of 10 or 12 disks at once makes the overall
> reliability
> of a large cluster very suspect.  This is because it becomes entirely too
> likely that two additional drives will fail before the data on the off-line
> node can be replicated.  For 100 nodes, that can decrease the average time
> to data loss down to less than a year.  This can only be mitigated in stock
> hadoop by keeping the number of drives relatively low.  MapR avoids this by
> not failing nodes for trivial problems.
>

I'd advise you to look at "stock hadoop" again. This used to be true, but
was fixed a long while back by HDFS-457 and several followup JIRAs.

If MapR does something fancier, I'm sure we'd be interested to hear about it
so we can compare the approaches.

-Todd


>
> On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng <aeng@maprtech.com> wrote:
>
> > >Keeping the amount of disks per node low and the amount of nodes high
> > should keep the impact of dead nodes in control.
> >
> > It keeps the impact of dead nodes in control but I don't think thats
> > long-term cost efficient.  As prices of 10GbE go down, the "keep the node
> > small" arguement seems less fitting.  And on another note, most servers
> > manufactured in the last 10 years have dual 1GbE network interfaces.  If
> > one
> > were to go by these calcs:
> >
> > >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around
> > ~32
> > minutes to recover
> >
> > It seems like that assumes a single 1GbE interface, why  not leverage the
> > second?
> >
> > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <Evert.Lammerts@sara.nl
> > >wrote:
> >
> > > > You can get 12-24 TB in a server today, which means the loss of a
> > server
> > > > generates a lot of traffic -which argues for 10 Gbe.
> > > >
> > > > But
> > > >   -big increase in switch cost, especially if you (CoI warning) go
> with
> > > > Cisco
> > > >   -there have been problems with things like BIOS PXE and lights out
> > > > management on 10 Gbe -probably due to the NICs being things the BIOS
> > > > wasn't expecting and off the mainboard. This should improve.
> > > >   -I don't know how well linux works with ether that fast (field
> > reports
> > > > useful)
> > > >   -the big threat is still ToR switch failure, as that will trigger a
> > > > re-replication of every block in the rack.
> > >
> > > Keeping the amount of disks per node low and the amount of nodes high
> > > should keep the impact of dead nodes in control. A ToR switch failing
> is
> > > different - missing 30 nodes (~120TB) at once cannot be fixed by adding
> > more
> > > nodes; that actually increases ToR switch failure. Although such
> failure
> > is
> > > quite rare to begin with, I guess. The back-of-the-envelope-calculation
> I
> > > made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet.
> > (e.g.,
> > > when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each,
> > with
> > > HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
> > > should take around 640 seconds. Also see the attached spreadsheet.)
> This
> > > doesn't take ToR switch failure in account though. On the other hand -
> > 150
> > > nodes is only ~5 racks - in such a scenario you might rather want to
> shut
> > > the system down completely rather than letting it replicate 20% of all
> > data.
> > >
> > > Cheers,
> > > Evert
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message