Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 386B946B8 for ; Fri, 1 Jul 2011 18:10:27 +0000 (UTC) Received: (qmail 85899 invoked by uid 500); 1 Jul 2011 18:10:25 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 85830 invoked by uid 500); 1 Jul 2011 18:10:25 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 85822 invoked by uid 99); 1 Jul 2011 18:10:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 18:10:24 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 18:10:15 +0000 Received: by vws7 with SMTP id 7so4211022vws.35 for ; Fri, 01 Jul 2011 11:09:54 -0700 (PDT) Received: by 10.52.173.111 with SMTP id bj15mr5044555vdc.122.1309543794055; Fri, 01 Jul 2011 11:09:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.109.8 with HTTP; Fri, 1 Jul 2011 11:09:33 -0700 (PDT) X-Originating-IP: [64.105.168.204] In-Reply-To: <4E0DF1FC.5030409@apache.org> References: <4E086BC7.5000208@apache.org> <9068C5A7-6090-42A3-8891-F60EF3B660BD@navteq.com> <918BB6E5-EF4B-4EB2-9B55-8BCBAF6DBA7B@navteq.com> <4E09A60C.1090605@apache.org> <4E0DF1FC.5030409@apache.org> From: Ted Dunning Date: Fri, 1 Jul 2011 11:09:33 -0700 Message-ID: Subject: Re: Hadoop Java Versions To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec51b9ee5a999ad04a705ec12 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec51b9ee5a999ad04a705ec12 Content-Type: text/plain; charset=ISO-8859-1 On Fri, Jul 1, 2011 at 9:12 AM, Steve Loughran wrote: > On 01/07/2011 01:16, Ted Dunning wrote: > >> You have to consider the long-term reliability as well. >> >> Losing an entire set of 10 or 12 disks at once makes the overall >> reliability >> of a large cluster very suspect. This is because it becomes entirely too >> likely that two additional drives will fail before the data on the >> off-line >> node can be replicated. For 100 nodes, that can decrease the average time >> to data loss down to less than a year. >> > > There's also Rodrigo's work on alternate block placement that doesn't > scatter blocks quite so randomly across a cluster, so a loss of a node or > rack doesn't have adverse effects on so many files > > https://issues.apache.org/**jira/browse/HDFS-1094 I did calculations based on this as well. The heuristic level of the computation is pretty simple, but to go any deeper, you have a pretty hair computation. My own approach was to use Monte Carlo Markov Chain to sample from the failure mode distribution. The codes that I wrote for this used pluggable placement, replication and failure models. I may have lacked sufficient cleverness at the time, but it was very difficult to come up with structured placement policies that actually improved the failure probabilities. Most such strategies massively decreased average probabilities. My suspicion by analogy with large structured error correction codes is that there are structured placement policies that perform well, but that in reasonably large clusters (number of disks > 50, say), that random placement will be within epsilon of the best possible strategy with very high probability. > Given that most HDDs failures happen on cluster reboot, it is possible for > 10-12 disks not to come up at the same time, if the cluster has been up for > a while, but like Todd says -worry. At least a bit. > Indeed. Thank goodness also that disk manufacturers tend to be pessimistic in quoting MTBF. These possibilities of correlated failure seriously complicate these computations, of course. > I've heard hints of one FS that actually includes HDD batch data in block > placement, to try and scatter data across batches, and be biased towards > using new HDDs for temp storage during burn-in. Some research work on doing > that to HDFS could be something to keep some postgraduate busy for a while, > "Disk batch-aware block placement". > Sadly, I can't comment on my knowledge of this except to say that there are non-obvious solutions to this that are embedded in at least one commercial map-reduce related product. I can't say which without getting chastised. > This can only be mitigated in stock >> hadoop by keeping the number of drives relatively low. >> > > now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? Per system. > ..with the most number of non-RAIDed HDDs out there -things like Lustre and > IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form > of Hadoop support too, if you ask nicely). HDD/server numbers matter in that > in a small cluster, it's better to have fewer machines to get more servers > to spread the data over; you don't really want your 100 TB in three 1U > servers. As your cluster grows -and you care more about storage capacity > than raw compute- then the appeal of 24+ TB/server starts to look good, and > that's when you care about the improvements to datanodes handling loss of > worker disk better. Even without that, rebooting the DN may fix things, but > the impact on ongoing work is the big issue -you don't just lose a > replicated block, you lose data. > Generally, I agree with what you say. The effect of RAID is to squeeze the error distributions around so that partial failures have lower probability. This is complex in the aggregate. > > Cascade failures leading to cluster outages are a separate issue and > normally triggered by switch failure/config than anything else. It doesn't > matter how reliable the hardware is if it gets the wrong configuration > > Indeed. --bcaec51b9ee5a999ad04a705ec12--