Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <4E0DF1FC.5030409@apache.org>
References: <CA2CD1F8.3F91C%scott@richrelevance.com>
 <4E086BC7.5000208@apache.org>
 <BANLkTimZ4SpW6wEGCGT=g6SfMXq4g2UaWw@mail.gmail.com>
 <9068C5A7-6090-42A3-8891-F60EF3B660BD@navteq.com>
 <BANLkTimi1XDyaV7P0meXV=v5z60izzNc6A@mail.gmail.com>
 <918BB6E5-EF4B-4EB2-9B55-8BCBAF6DBA7B@navteq.com>
 <4E09A60C.1090605@apache.org>
 <ADF94D8555C7A246B86A633685E0178AB90A35CA96@planck.ka.sara.nl>
 <BANLkTi=wqMG3cYpN3R_acBjD3bKApqSwwA@mail.gmail.com>
 <BANLkTikPCqRtdTiEP-Ct6jZ84XcjAh9UQQ@mail.gmail.com>
 <4E0DF1FC.5030409@apache.org>
From: Ted Dunning <tdunning@maprtech.com>
Date: Fri, 1 Jul 2011 11:09:33 -0700
Message-ID: <BANLkTi=B1C9Y0Rhw_vC_AEj-9s6BeWugCw@mail.gmail.com>
Subject: Re: Hadoop Java Versions
To: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec51b9ee5a999ad04a705ec12

--bcaec51b9ee5a999ad04a705ec12
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Jul 1, 2011 at 9:12 AM, Steve Loughran <stevel@apache.org> wrote:

> On 01/07/2011 01:16, Ted Dunning wrote:
>
>> You have to consider the long-term reliability as well.
>>
>> Losing an entire set of 10 or 12 disks at once makes the overall
>> reliability
>> of a large cluster very suspect.  This is because it becomes entirely too
>> likely that two additional drives will fail before the data on the
>> off-line
>> node can be replicated.  For 100 nodes, that can decrease the average time
>> to data loss down to less than a year.
>>
>
> There's also Rodrigo's work on alternate block placement that doesn't
> scatter blocks quite so randomly across a cluster, so a loss of a node or
> rack doesn't have adverse effects on so many files
>
> https://issues.apache.org/**jira/browse/HDFS-1094<https://issues.apache.org/jira/browse/HDFS-1094>


I did calculations based on this as well.  The heuristic level of the
computation is pretty simple, but to go any deeper, you have a pretty hair
computation.  My own approach was to use Monte Carlo Markov Chain to sample
from the failure mode distribution.  The codes that I wrote for this used
pluggable placement, replication and failure models.

I may have lacked sufficient cleverness at the time, but it was very
difficult to come up with structured placement policies that actually
improved the failure probabilities.  Most such strategies massively
decreased average probabilities.  My suspicion by analogy with large
structured error correction codes is that there are structured placement
policies that perform well, but that in reasonably large clusters (number of
disks > 50, say), that random placement will be within epsilon of the best
possible strategy with very high probability.


> Given that most HDDs failures happen on cluster reboot, it is possible for
> 10-12 disks not to come up at the same time, if the cluster has been up for
> a while, but like Todd says -worry. At least a bit.
>

Indeed.

Thank goodness also that disk manufacturers tend to be pessimistic in
quoting MTBF.

These possibilities of correlated failure seriously complicate these
computations, of course.


> I've heard hints of one FS that actually includes HDD batch data in block
> placement, to try and scatter data across batches, and be biased towards
> using new HDDs for temp storage during burn-in. Some research work on doing
> that to HDFS could be something to keep some postgraduate busy for a while,
> "Disk batch-aware block placement".
>

Sadly, I can't comment on my knowledge of this except to say that there are
non-obvious solutions to this that are embedded in at least one commercial
map-reduce related product.  I can't say which without getting chastised.


>  This can only be mitigated in stock
>> hadoop by keeping the number of drives relatively low.
>>
>
> now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem?


Per system.


> ..with the most number of non-RAIDed HDDs out there -things like Lustre and
> IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form
> of Hadoop support too, if you ask nicely). HDD/server numbers matter in that
> in a small cluster, it's better to have fewer machines to get more servers
> to spread the data over; you don't really want your 100 TB in three 1U
> servers. As your cluster grows -and you care more about storage capacity
> than raw compute- then the appeal of 24+ TB/server starts to look good, and
> that's when you care about the improvements to datanodes handling loss of
> worker disk better. Even without that, rebooting the DN may fix things, but
> the impact on ongoing work is the big issue -you don't just lose a
> replicated block, you lose data.
>

Generally, I agree with what you say.  The effect of RAID is to squeeze the
error distributions around so that partial failures have lower probability.
  This is complex in the aggregate.


>
> Cascade failures leading to cluster outages are a separate issue and
> normally triggered by switch failure/config than anything else. It doesn't
> matter how reliable the hardware is if it gets the wrong configuration
>
>
Indeed.

--bcaec51b9ee5a999ad04a705ec12--