The reliability question can be answered to first order by computing
replication time for a unit of storage and then computing how often that
replication time will contain additional failures sufficient to cause data
loss. Such data loss events should be roughly Poisson distributed with rate
equal to the rate of the original failures times the probability that any
failure actually is a data loss. Second order effects appear when one
replication spills into the next increasing the replication period for the
second event. It is difficult to impossible to account for all of the
second order effects in closed form and I have found it necessary to resort
to discrete event simulation to estimate failure mode probabilities in
detail. For small numbers of disks per node, one second order effect that
becomes important is the node failure rate.
Grouping disks into storage groups or failing an entire node when one disk
fails are ways that the storage units are larger than individual disks. Use
of a volume manager or RAID0 will increase the storage unit size.
These failure modes drive some limitations on cluster size since the
absolute rate of storage unit failures increases with cluster size. For a
fixed number of drives in each storage unit, the limiting factor is the
total number of disk drives, not the number of nodes. For older versions of
Hadoop, the storage unit was all drives on the system which is quite
dangerous in terms of mean time to data loss. More recently, a fix has been
committed to trunk (and I think .204, Todd will correct me if I am wrong)
that makes the storage unit equal to a single drive. In the previous
situation, it was dangerous to have too many drives on each node in large
clusters. With single disk storage units, the number of drives per machine
does not matter in this computation.
To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive
MTBF of 1000 days, we should be seeing drive failures on average once per
day. With 1G ethernet and 30MB/s/node dedicated to rereplication, it will
just over 10 minutes to restore replication of a single drive and will take
just over 100 minutes to restore replication of an entire machine. The
probability of 2 disk failures during the 15 minutes after a failure is
roughly \lambda^2 e^\lambda / 2 where \lambda = 15 minutes / 24 hours.
This is a small probability so average times between data loss should be
relatively long. For the larger storage unit of 10 disks, the probability
is not so small and data loss should be expected every few years or so.
For a 10,000 node cluster, however, we should expect the average rate of
disk failure rate of one failure every 2.5 hours. Here, the number of disks
is large enough that the first order computation is much less accurate since
the placement of disk blocks across the cluster will often have more
nonuniformity due to small counts. This nonuniformity increases the
replication recovery time. With the large storage unit model, the
probability that three disk failures will stack up becomes unacceptably
large. Even with the single disk storage unit, the data loss rate becomes
large enough that the cluster cannot be considered archival.
The real question about optimal configuration depends on how fast the
cluster can move data from disk. If this rate is relatively low compared to
the hardware speeds, then supporting full performance from large numbers of
drives is very difficult. If you can maintain high transfer rates, however,
you can substantially decrease the cost of your cluster by having fewer
nodes.
On Wed, Aug 10, 2011 at 7:56 AM, Evert Lammerts <Evert.Lammerts@sara.nl>wrote:
> A short, slightly offtopic question:
>
> > Also note that in this configuration that one cannot take
> > advantage of the "keep the machine up at all costs" features in newer
> > Hadoop's, which require that root, swap, and the log area be mirrored
> > to be truly effective. I'm not quite convinced that those features are
> > worth it yet for anything smaller than maybe a 12 disk config.
>
> Dell and Cloudera promote the C2100. I'd like to see the calculations
> behind that config. Am I wrong thinking that keeping your cluster up with
> such dense nodes will only work if you have many (order of magnitude 100+)
> of them, and interconnected with 10Gb Ethernet? If you don't then recovery
> times from failing disks / rack switches are going to get crazy, right? If
> you want to get bang for buck, don't the proportions "disk IO / processing
> power", "node storage capacity / ethernet speed" and "total amount of nodes
> / ethernet speed", indicate many small nodes with not too many disks and 1Gb
> Ethernet?
>
> Cheers,
> Evert
>
