hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Hadoop Java Versions
Date Fri, 01 Jul 2011 16:12:44 GMT
On 01/07/2011 01:16, Ted Dunning wrote:
> You have to consider the long-term reliability as well.
> Losing an entire set of 10 or 12 disks at once makes the overall reliability
> of a large cluster very suspect.  This is because it becomes entirely too
> likely that two additional drives will fail before the data on the off-line
> node can be replicated.  For 100 nodes, that can decrease the average time
> to data loss down to less than a year.

There's also Rodrigo's work on alternate block placement that doesn't 
scatter blocks quite so randomly across a cluster, so a loss of a node 
or rack doesn't have adverse effects on so many files


Given that most HDDs failures happen on cluster reboot, it is possible 
for 10-12 disks not to come up at the same time, if the cluster has been 
up for a while, but like Todd says -worry. At least a bit.

I've heard hints of one FS that actually includes HDD batch data in 
block placement, to try and scatter data across batches, and be biased 
towards using new HDDs for temp storage during burn-in. Some research 
work on doing that to HDFS could be something to keep some postgraduate 
busy for a while, "Disk batch-aware block placement".

> This can only be mitigated in stock
> hadoop by keeping the number of drives relatively low.

now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? 
Because it seems to me that "stock" HDFS's use in production makes it 
one of the filesystems in the planet with the most number of non-RAIDed 
HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP 
IBRIX (the last two of which have some form of Hadoop support too, if 
you ask nicely). HDD/server numbers matter in that in a small cluster, 
it's better to have fewer machines to get more servers to spread the 
data over; you don't really want your 100 TB in three 1U servers. As 
your cluster grows -and you care more about storage capacity than raw 
compute- then the appeal of 24+ TB/server starts to look good, and 
that's when you care about the improvements to datanodes handling loss 
of worker disk better. Even without that, rebooting the DN may fix 
things, but the impact on ongoing work is the big issue -you don't just 
lose a replicated block, you lose data.

Cascade failures leading to cluster outages are a separate issue and 
normally triggered by switch failure/config than anything else. It 
doesn't matter how reliable the hardware is if it gets the wrong 

View raw message