Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45CF0175F6 for ; Wed, 1 Oct 2014 22:25:47 +0000 (UTC) Received: (qmail 89895 invoked by uid 500); 1 Oct 2014 22:25:42 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 89792 invoked by uid 500); 1 Oct 2014 22:25:42 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 89781 invoked by uid 99); 1 Oct 2014 22:25:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 22:25:42 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 22:25:38 +0000 Received: by mail-la0-f48.google.com with SMTP id gi9so1292601lab.21 for ; Wed, 01 Oct 2014 15:25:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ghostar.org; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=LMoWgdnnaMVvRDsfpEp3OnXcfDpWKgll8T18MhdIMEI=; b=PzXGKHv2WRNr3cE8v44JH3Wv1T1WDmbQFRLX5ZBBFXtxD7aRy6ZREXDcu51Gmu9Dmg 4dzIXnuE853151VTz+K8NzJNdYipUJfpHmWobb6CgLOHIZQrwgaAEdg/yM+gcQ06HmiO t15g1nCTCu6Wgrw8Z7W5ZES9upUX5RjdwJseY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=LMoWgdnnaMVvRDsfpEp3OnXcfDpWKgll8T18MhdIMEI=; b=LZilRfxfK2cEDLVmUG/Bi9wj+DwZYcWsW2bMKOx0nIYu6qvCmHztwK0RIP2uBNdKkL MdWN0pXY6fRjU8aSKRCoh/E7RteWOXFhEx5CVmCZsv/6roeOzrCpJmgQrSzGGrEjTxGk /RxK/RHfV4kN00sZapxE+GXOkP5KF0D0pJOxo5m0ERzrIV5pnjaVBWYDagLOIIOvnD9K nj+90E+sFg35AdU5yZAi4xftWBfEQoD7mmtgFLA+cH8uv44xIVIYLfJoqxV+wx5Rj9YI j981cyI1atjf2Lhk0Uwogdk4SDjGyrr8Ab/wpSy7p8yZ0gRqR4L+2aCEIFizM59HtQcA xrwQ== X-Gm-Message-State: ALoCoQn82+QA0YUfYxnprTgbtew7duAbspHgGjePSutId+urgnZDM3Y0qku2LdlwASfWSzO6i1At MIME-Version: 1.0 X-Received: by 10.112.122.103 with SMTP id lr7mr3243621lbb.48.1412202316266; Wed, 01 Oct 2014 15:25:16 -0700 (PDT) Received: by 10.25.23.213 with HTTP; Wed, 1 Oct 2014 15:25:16 -0700 (PDT) In-Reply-To: <542C6BBB.30404@ulul.org> References: <542C6BBB.30404@ulul.org> Date: Wed, 1 Oct 2014 17:25:16 -0500 Message-ID: Subject: Re: Hadoop and RAID 5 From: Travis To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bf0c4126995a3050463f9ff X-Virus-Checked: Checked by ClamAV on apache.org --047d7bf0c4126995a3050463f9ff Content-Type: text/plain; charset=UTF-8 On Wed, Oct 1, 2014 at 4:01 PM, Ulul wrote: > Dear hadoopers, > > Has anyone been confronted to deploying a cluster in a traditional IT shop > whose admins handle thousands of servers ? > They traditionally use SAN or NAS storage for app data, rely on RAID 1 for > system disks and in the few cases where internal disks are used, they > configure them with RAID 5 provided by the internal HW controller. > > Yes. I've been on both sides of this discussion. The key is to help them understand that you don't need redundancy within a system because Hadoop provides redundancy across the entire cluster via replication. This then leaves the problem as a performance one, in which case you show them benchmarks on the hardware they provide in both RAID (RAID0, RAID1, and RAID5) and JBOD modes. > Using a JBOD setup , as advised in each and every Hadoop doc I ever laid > my hands on, means that each HDD failure will imply, on top of the physical > replacement of the drive, that an admin performs at least an mkfs. > Added to the fact that these operations will become more frequent since > more internal disks will be used, it can be perceived as an annoying > disruption in industrial handling of numerous servers. > > I fail to see how this is really any different than the process of having to deal with a failed drive in an array. Depending on your array type, you may still have to do things to quiesce the bus before doing any drive operation, such as adding or removing the drive, you may still have to trigger the rebuild yourself, and so on. I have a few thousand disks in my cluster. We lose about 3-5 a quarter. I don't find it any more work to re-mkfs the drive after it's been swapped out and have built tools around the process to make sure it's consistently done by our DC staff (and yes, I did it before the DC staff was asked to). If you're concerned about the high-touch aspect of swapping disks out, then you can always configure the datanode to be tolerant of multiple disk failures (something you cannot do with RAID5) and then just take the whole machine out of the cluster to do swaps when you've reached a particular threshold of bad disks. > In Tom White's guide there is a discussion of RAID 0, stating that Yahoo > benchmarks showed a 10% loss in performance so we can expect even worse > perf with RAID 5 but I found no figures. > I had to re-read that section for reference. My apologies if the following is a little long-winded and rambling. I'm going to assume that Tom is not talking about single-disk RAID0 volumes, which is a common way of doing JBOD with a RAID controller that doesn't have JBOD support. In general, performance is going to depend upon how many active streams of I/O you have going on the system. With JBOD, as Tom discusses, every spindle is it's own unique snow flake, and if your drive controller can keep up, you can write as fast as that drive can handle reading off the bus. Performance is going to depend upon how many active reading/writing streams you have accessing each spindle in the systems. If I had one stream, I would only get the performance of one spindle in the JBOD. If I had twelve spindles, I'm going to get maximum performance with at least twelve streams. With RAID0, you're taking your one stream, cutting it up into multiple parts and either reading it or writing it to all disks, taking advantage of the performance of all spindles. The problem arises when you start adding more streams in parallel to the RAID0 environment. Each parallel I/O operation begins competing with each other from the controller's standpoint. Sometimes things start to stack up as the controller has to wait for competing I/O operations on a single spindle. For example, having to wait for a write to complete before a read can be done. At a certain point, the performance of RAID0 begins to hit a knee as the number of I/O requests goes up because the controller becomes the bottleneck. RAID0 is going to be the closest in performance, but with the risk that if you lose a single disk, you lose the entire RAID. With JBOD, if you lose a single disk, you only lose the data on that disk. Now, with RAID5, you're going to have even worse performance because you're dealing with not only the parity calculation, but also with the fact that you incur a performance penalty during reads and writes due to how the data is laid out across all disks in the RAID. You ca read more about this here: http://theithollow.com/2012/03/understanding-raid-penalty/ To put this in perspective, I use 12 7200rpm NLSAS disks in a system connected to an LSI9207 SAS controller. This is configured for JBOD. I have benchmarked streaming reads and writes in this environment to be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a total of 12 i/o streams occurring on the system. Btw, this benchmark has held stable at this rate for at least 3 i/o streams per spindle; I haven't tested higher yet. Now, I might get this performance with RAID0, but why should I tolerate the risk of losing all data on the system vs just the data on a single drive? Going with RAID0 means that not only do I have to replace the disk, but now I have to have Hadoop rebalance/redistribute data to the entire system, not just dealing with the small amount of data missing from one spindle. And since Hadoop is already handling my redundancy via replication of data, why should I tolerate the performance penalty associated with RAID5? I don't need redundancy in a *single* system, I need redundancy across the entire cluster. > I also found an Hortonworks interview of StackIQ who provides software to > automate such failure fix up. But it would be rather painful to go straight > to another solution, contract and so on while starting with Hadoop. > > Please share your experiences around RAID for redundancy (1, 5 or other) > in Hadoop conf. > > I can't see any situation that we would use RAID for the data drives in our Hadoop cluster. We only use RAID1 for the OS drives, simply because we want to reduce the recovery period associated with a system failure. No reason to re-install a system and have to replicate data back onto it if we don't have to. Cheers, Travis -- Travis Campbell travis@ghostar.org --047d7bf0c4126995a3050463f9ff Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

= On Wed, Oct 1, 2014 at 4:01 PM, Ulul <hadoop@ulul.org> wrote:<= br>
=20 =20 =20
Dear hadoopers,

Has anyone been confronted to deploying a cluster in a traditional IT shop whose admins handle thousands of servers ?
They traditionally use SAN or NAS storage for app data, rely on RAID 1 for system disks and in the few cases where internal disks are used, they configure them with RAID 5 provided by the internal HW controller.


Yes.=C2=A0 I've= been on both sides of this discussion.

The key is= to help them understand that you don't need redundancy within a system= because Hadoop provides redundancy across the entire cluster via replicati= on.=C2=A0 This then leaves the problem as a performance one, in which case = you show them benchmarks on the hardware they provide in both RAID (RAID0, = RAID1, and RAID5) and JBOD modes.
=C2=A0
Using a JBOD setup , as advised in each and every Hadoop doc I ever laid my hands on, means that each HDD failure will imply, on top of the physical replacement of the drive, that an admin performs at least an mkfs.
Added to the fact that these operations will become more frequent since more internal disks will be used, it can be perceived as an annoying disruption in industrial handling of numerous servers.


I fail to see how t= his is really any different than the process of having to deal with a faile= d drive in an array.=C2=A0 Depending on your array type, you may still have= to do things to quiesce the bus before doing any drive operation, such as = adding or removing the drive, you may still have to trigger the rebuild you= rself, and so on. =C2=A0

I have a few thousand dis= ks in my cluster.=C2=A0 We lose about 3-5 a quarter.=C2=A0 I don't find= it any more work to re-mkfs the drive after it's been swapped out and = have built tools around the process to make sure it's consistently done= by our DC staff (and yes, I did it before the DC staff was asked to).=C2= =A0 If you're concerned about the high-touch aspect of swapping disks o= ut, then you can always configure the datanode to be tolerant of multiple d= isk failures (something you cannot do with RAID5) and then just take the wh= ole machine out of the cluster to do swaps when you've reached a partic= ular threshold of bad disks.

=C2=A0=C2=A0
In Tom White's guide there is a discussion of RAID 0, stating tha= t Yahoo benchmarks showed a 10% loss in performance so we can expect even worse perf with RAID 5 but I found no figures.
<= /blockquote>

I had to re-read that section for reference= .=C2=A0 My apologies if the following is a little long-winded and rambling.= =C2=A0

I'm going to assume that Tom is not ta= lking about single-disk RAID0 volumes, which is a common way of doing JBOD = with a RAID controller that doesn't have JBOD support.

In general, performance is going to depend upon how many active st= reams of I/O you have going on the system.=C2=A0

W= ith JBOD, as Tom discusses, every spindle is it's own unique snow flake= , and if your drive controller can keep up, you can write as fast as that d= rive can handle reading off the bus.=C2=A0 Performance is going to depend u= pon how many active reading/writing streams you have accessing each spindle= in the systems. =C2=A0

If I had one stream, I wou= ld only get the performance of one spindle in the JBOD. If I had twelve spi= ndles, I'm going to get maximum performance with at least twelve stream= s. With RAID0, you're taking your one stream, cutting it up into multip= le parts and either reading it or writing it to all disks, taking advantage= of the performance of all spindles.

The problem a= rises when you start adding more streams in parallel to the RAID0 environme= nt.=C2=A0 Each parallel I/O operation begins competing with each other from= the controller's standpoint.=C2=A0 Sometimes things start to stack up = as the controller has to wait for competing I/O operations on a single spin= dle.=C2=A0 For example, having to wait for a write to complete before a rea= d can be done.

At a certain point, the performance= of RAID0 begins to hit a knee as the number of I/O requests goes up becaus= e the controller becomes the bottleneck.=C2=A0 RAID0 is going to be the clo= sest in performance, but with the risk that if you lose a single disk, you = lose the entire RAID.=C2=A0 With JBOD, if you lose a single disk, you only = lose the data on that disk.

Now, with RAID5, you&#= 39;re going to have even worse performance because you're dealing with = not only the parity calculation, but also with the fact that you incur a pe= rformance penalty during reads and writes due to how the data is laid out a= cross all disks in the RAID. =C2=A0 You ca read more about this here: =C2= =A0h= ttp://theithollow.com/2012/03/understanding-raid-penalty/
To put this in perspective, I use 12 7200rpm NLSAS disks in a s= ystem connected to an LSI9207 SAS controller.=C2=A0 This is configured for = JBOD.=C2=A0 I have benchmarked streaming reads and writes in this environme= nt to be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a= total of 12 i/o streams occurring on the system.=C2=A0 Btw, this benchmark= has held stable at this rate for at least 3 i/o streams per spindle; I hav= en't tested higher yet.=C2=A0

Now, I might get= this performance with RAID0, but why should I tolerate the risk of losing = all data on the system vs just the data on a single drive?=C2=A0 Going with= RAID0 means that not only do I have to replace the disk, but now I have to= have Hadoop rebalance/redistribute data to the entire system, not just dea= ling with the small amount of data missing from one spindle.=C2=A0 And sinc= e Hadoop is already handling my redundancy via replication of data, why sho= uld I tolerate the performance penalty associated with RAID5?=C2=A0 I don&#= 39;t need redundancy in a *single* system, I need redundancy across the ent= ire cluster.



I also found an Hortonworks interview of StackIQ who provides software to automate such failure fix up. But it would be rather painful to go straight to another solution, contract and so on while starting with Hadoop.

Please share your experiences around RAID for redundancy (1, 5 or other) in Hadoop conf.
<= br>

I can't = see any situation that we would use RAID for the data drives in our Hadoop = cluster.=C2=A0 We only use RAID1 for the OS drives, simply because we want = to reduce the recovery period associated with a system failure.=C2=A0 No re= ason to re-install a system and have to replicate data back onto it if we d= on't have to.=C2=A0

Cheers,
Travis
--
Travis Campbell
travis@ghostar.org
--047d7bf0c4126995a3050463f9ff--