hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Noguchi <knogu...@yahoo-inc.com>
Subject Re: Sizing help
Date Fri, 11 Nov 2011 17:26:16 GMT
Another factor to consider, when disk is bad you may have corrupted blocks which may only get
detected by the periodic DataBlockScanner check.
I believe each datanode tries to finish the entire scan in dfs.datanode.scan.period.hours
(3weeks default) period.
So with 2x replication and some undetected bad disk(s), you can have blocks with effective
replication of 1 which would lead to missing blocks eventually.


On 11/11/11 1:57 AM, "Matt Foley" <mfoley@hortonworks.com> wrote:

I agree with Ted's argument that 3x replication is way better than 2x.  But I do have to point
out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!)
unless it's the system disk.  So in the example given, if you estimate a disk failure every
2 hours, each node only has to re-replicate about 2GB of data, not 12GB.  So about 1-in-72
such failures risks data loss, rather than 1-in-12.  Which is still unacceptable, so use 3x
replication! :-)

On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <tdunning@maprtech.com> wrote:
3x replication has two effects.  One is reliability.  This is probably more important in large
clusters than small.

Another important effect is data locality during map-reduce.  Having 3x replication allows
mappers to have almost all invocations read from local disk.  2x replication compromises this.
 Even where you don't have local data, the bandwidth available to read from 3x replicated
data is 1.5x the bandwidth available for 2x replication.

To get a rough feel for how reliable you should consider a cluster, you can do a pretty simple
computation.  If you have 12 x 2T on a single machine and you lose that machine, the remaining
copies of that data must be replicated before another disk fails.  With HDFS and block-level
replication, the remaining copies will be spread across the entire cluster to any disk failure
is reasonably like to cause data loss.  For a 1000 node cluster with 12000 disks, it is conservative
to estimate a disk failure on average every 2 hours.  Each node will have replicate about
12GB of data which will take about 500 seconds or about 9 or 10 minutes if you only use 25%
of your network for re-replication.  The probability of a disk failure  during a 10 minute
period is 1-exp(-10/120) = 8%.  This means that roughly 1 in 12 full machine failures might
cause data loss.   You can pick whatever you like for the rate at which nodes die, but I don't
think that this is acceptable.

My numbers for disk failures are purposely somewhat pessimistic.  If you change the MTBF for
disks to 10 years instead of 3 years, then the probability of data loss after a machine failure
drops, but only to about 2.5%.

Now, I would be the first to say that these numbers feel too high, but I also would rather
not experience enough data loss events to have a reliable gut feel for how often they should

My feeling is that 2x is fine for data you can reconstruct and which you don't need to read
really fast, but not good enough for data whose loss will get you fired.

On Mon, Nov 7, 2011 at 7:34 PM, Rita <rmorgan466@gmail.com> wrote:
I have been running with 2x replication on a 500tb cluster. No issues whatsoever. 3x is for
super paranoid.

On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <tdunning@maprtech.com> wrote:
Depending on which distribution and what your data center power limits are you may save a
lot of money by going with machines that have 12 x 2 or 3 tb drives.  With suitable engineering
margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack.  If
you want to go all cowboy with 2x replication and little space to spare then you can double
that density.

On Monday, November 7, 2011, Rita <rmorgan466@gmail.com> wrote:
> For a 1PB installation you would need close to 170 servers with 12 TB disk pack installed
on them (with replication factor of 2). Thats a conservative estimate
> CPUs: 4 cores with 16gb of memory
> Namenode: 4 core with 32gb of memory should be ok.
> On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <sedison70@gmail.com> wrote:
>> I am a newbie to Hadoop and trying to understand how to Size a Hadoop cluster.
>> What are factors I should consider deciding the number of datanodes ?
>> Datanode configuration ?  CPU, Memory
>> Amount of memory required for namenode ?
>> My client is looking at 1 PB of  usable data and will be running analytics on TB
size files using mapreduce.
>> Thanks
>> ….. Steve
> --
> --- Get your facts first, then you can distort them as you please.--

View raw message