Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBD95EC3A for ; Mon, 11 Feb 2013 16:03:18 +0000 (UTC) Received: (qmail 61596 invoked by uid 500); 11 Feb 2013 16:03:13 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 61377 invoked by uid 500); 11 Feb 2013 16:03:12 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 61364 invoked by uid 99); 11 Feb 2013 16:03:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Feb 2013 16:03:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael@cloudera.com designates 209.85.212.177 as permitted sender) Received: from [209.85.212.177] (HELO mail-wi0-f177.google.com) (209.85.212.177) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Feb 2013 16:03:08 +0000 Received: by mail-wi0-f177.google.com with SMTP id hm14so3289842wib.4 for ; Mon, 11 Feb 2013 08:02:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=XLJRKRLvLNvUQRLHPs5mJe0bknn398k+eWfGzvhfcAs=; b=SBgzE5PBzl0fFvmMOc18jNw8txAmPtDlGeD57yvZkdHmPEPEt0mNdzYvVpo8TAibHX Zjmgom5soSJOOLOdQzAyBNNFPeYsrLZyMwQ28CaVaPc+WTWvI9SntukEvw0CEWT0AAPe YZZzKh4420ApwMGjPS6qrx9mxseTMZydsqVlY/HPn0TdeK0hbKMWzJV6M/EXxoz1w1iR YOxnHidVrLT1dFqoVo6nJktSEPcdkzGxDon/goX5jyVYpmduch4VwEbU42NSIwLDozfA ubT2pZaHQRLdq7iNLh3Qxblf2JQBnP8nM7yCmyX/fVTUQlk33KHQW/lOikaS6P7gCLeV 0Zqg== X-Received: by 10.194.238.226 with SMTP id vn2mr24952599wjc.23.1360598554950; Mon, 11 Feb 2013 08:02:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.217.56.134 with HTTP; Mon, 11 Feb 2013 08:02:13 -0800 (PST) In-Reply-To: References: <2907752921904630981@unknownmsgid> <-5730993611063969388@unknownmsgid> <511859C6.7040401@uci.cu> From: Michael Katzenellenbogen Date: Mon, 11 Feb 2013 11:02:13 -0500 Message-ID: Subject: Re: Mutiple dfs.data.dir vs RAID0 To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e01493c988cba4104d5750942 X-Gm-Message-State: ALoCoQlALm7bwuyc8NmQUaZxy/huBwVnLznIdtKc/WKlt3Pn2agznQo/GECWjiFpIOv4sjosTFqF X-Virus-Checked: Checked by ClamAV on apache.org --089e01493c988cba4104d5750942 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Feb 11, 2013 at 10:54 AM, Jean-Marc Spaggiari < jean-marc@spaggiari.org> wrote: > thanks all for your feebacks. > > I have updated with hdfs config to add another dfs.data.dir entry and > restarted the node. Hadoop is starting to use the entry, but is not > spreading the existing data over the 2 directories. > > Let's say you have a 2TB disk on /hadoop1, almost full. If you add > another 2TB disk on /hadoop2 and add it on dfs.data.dir, hadoop will > start to write into /hadoop1 and /hadoop2, but /hadoop1 will stay > almost full. It will not balance the already existing data over the 2 > directories. > > I have deleted all the content of /hadoop1 and /hadoop2 and restarted > the node and now the data is spread over the 2. Just need to wait for > the replication to complete. > > So what I will do instead is, I will add 2 x 2TB drives, mount them as > raid0 then move the existing data into this drive and remove the > reprious one. That way hadoop will see still one directory under > /hadoop1 but it will be 4TB instead of 2TB... > > Is there anywhere where I can read about hadoop vs the different kind > of physical data storage configuration? (Book, web, etc.) > "Hadoop Operations" by E. Sammer: http://shop.oreilly.com/product/0636920025085.do > > JM > > 2013/2/11, Ted Dunning : > > Typical best practice is to have a separate file system per spindle. If > > you have a RAID only controller (many are), then you just create one RAID > > per spindle. The effect is the same. > > > > MapR is unusual able to stripe writes over multiple drives organized > into a > > storage pool, but you will not normally be able to achieve that same > level > > of performance with ordinary Hadoop by using LVM over JBOD or controller > > level RAID. The problem is that the Java layer doesn't understand that > the > > storage is striped and the controller doesn't understand what Hadoop is > > doing. MapR schedules all of the writes to individual spindles via a > very > > fast state machine embedded in the file system. > > > > The comment about striping increasing the impact of a single disk drive > is > > exactly correct and it makes modeling the failure modes of the system > > considerably more complex. The net result of the modeling that I and > > others have done is that moderate to large RAID groups in storage pools > for > > moderate sized clusters (< 2000 nodes or so) is just fine. For large > > clusters of up to 10,000 nodes, you should probably limit RAID groups to > 4 > > drives or less. > > > > On Sun, Feb 10, 2013 at 7:39 PM, Marcos Ortiz wrote: > > > >> We have seen in several of our Hadoop clusters that LVM degrades > >> performance of our M/R jobs, and I remembered a message where > >> Ted Dunning was explaining something about this, and since > >> that time, we don't use LVM for Hadoop data directories. > >> > >> About RAID volumes, the best performance that we have achieved > >> is using RAID 10 for our Hadoop data directories. > >> > >> > >> > >> On 02/10/2013 09:24 PM, Michael Katzenellenbogen wrote: > >> > >> Are you able to create multiple RAID0 volumes? Perhaps you can expose > >> each disk as its own RAID0 volume... > >> > >> Not sure why or where LVM comes into the picture here ... LVM is on > >> the software layer and (hopefully) the RAID/JBOD stuff is at the > >> hardware layer (and in the case of HDFS, LVM will only add unneeded > >> overhead). > >> > >> -Michael > >> > >> On Feb 10, 2013, at 9:19 PM, Jean-Marc Spaggiari< > jean-marc@spaggiari.org> > >> wrote: > >> > >> > >> The issue is that my MB is not doing JBOD :( I have RAID only > >> possible, and I'm fighting for the last 48h and still not able to make > >> it work... That's why I'm thinking about using dfs.data.dir instead. > >> > >> I have 1 drive per node so far and need to move to 2 to reduce WIO. > >> > >> What will be better with JBOD against dfs.data.dir? I have done some > >> tests JBOD vs LVM and did not find any pros for JBOD so far. > >> > >> JM > >> > >> 2013/2/10, Michael Katzenellenbogen > >> : > >> > >> One thought comes to mind: disk failure. In the event a disk goes bad, > >> then with RAID0, you just lost your entire array. With JBOD, you lost > >> one disk. > >> > >> -Michael > >> > >> On Feb 10, 2013, at 8:58 PM, Jean-Marc Spaggiari< > jean-marc@spaggiari.org> > >> wrote: > >> > >> > >> Hi, > >> > >> I have a quick question regarding RAID0 performances vs multiple > >> dfs.data.dir entries. > >> > >> Let's say I have 2 x 2TB drives. > >> > >> I can configure them as 2 separate drives mounted on 2 folders and > >> assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives > >> with RAID0 and assigned them as a single folder to dfs.data.dir. > >> > >> With RAID0, the reads and writes are going to be spread over the 2 > >> disks. This is significantly increasing the speed. But if I put 2 > >> entries in dfs.data.dir, hadoop is going to spread over those 2 > >> directories too, and at the end, ths results should the same, no? > >> > >> Any experience/advice/results to share? > >> > >> Thanks, > >> > >> JM > >> > >> > >> -- > >> Marcos Ortiz Valmaseda, > >> Product Manager && Data Scientist at UCI > >> Blog: http://marcosluis2186.posterous.com > >> Twitter: @marcosluis2186 > >> > > > --089e01493c988cba4104d5750942 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Mon, Feb 11, 2013 at 10:54 AM, Jean-Marc Spaggiari <= ;jean-marc@spa= ggiari.org> wrote:
thanks all for your feebacks.

I have updated with hdfs config to add another dfs.data.dir entry and
restarted the node. Hadoop is starting to use the entry, but is not
spreading the existing data over the 2 directories.

Let's say you have a 2TB disk on /hadoop1, almost full. If you add
another 2TB disk on /hadoop2 and add it on dfs.data.dir, hadoop will
start to write into /hadoop1 and /hadoop2, but /hadoop1 will stay
almost full. It will not balance the already existing data over the 2
directories.

I have deleted all the content of /hadoop1 and /hadoop2 and restarted
the node and now the data is spread over the 2. Just need to wait for
the replication to complete.

So what I will do instead is, I will add 2 x 2TB drives, mount them as
raid0 then move the existing data into this drive and remove the
reprious one. That way hadoop will see still one directory under
/hadoop1 but it will be 4TB instead of 2TB...

Is there anywhere where I can read about hadoop vs the different kind
of physical data storage configuration? (Book, web, etc.)
<= div>
"Hadoop Operations" by E. Sammer: http://shop.oreilly.com/pr= oduct/0636920025085.do
=A0

JM

2013/2/11, Ted Dunning <tdunnin= g@maprtech.com>:
> Typical best practice is to have a separate fil= e system per spindle. =A0If
> you have a RAID only controller (many are), then you just create one R= AID
> per spindle. =A0The effect is the same.
>
> MapR is unusual able to stripe writes over multiple drives organized i= nto a
> storage pool, but you will not normally be able to achieve that same l= evel
> of performance with ordinary Hadoop by using LVM over JBOD or controll= er
> level RAID. =A0The problem is that the Java layer doesn't understa= nd that the
> storage is striped and the controller doesn't understand what Hado= op is
> doing. =A0MapR schedules all of the writes to individual spindles via = a very
> fast state machine embedded in the file system.
>
> The comment about striping increasing the impact of a single disk driv= e is
> exactly correct and it makes modeling the failure modes of the system<= br> > considerably more complex. =A0The net result of the modeling that I an= d
> others have done is that moderate to large RAID groups in storage pool= s for
> moderate sized clusters (< 2000 nodes or so) is just fine. =A0For l= arge
> clusters of up to 10,000 nodes, you should probably limit RAID groups = to 4
> drives or less.
>
> On Sun, Feb 10, 2013 at 7:39 PM, Marcos Ortiz <mlortiz@uci.cu> wrote:
>
>> =A0We have seen in several of our Hadoop clusters that LVM degrade= s
>> performance of our M/R jobs, and I remembered a message where
>> Ted Dunning was explaining something about this, and since
>> that time, we don't use LVM for Hadoop data directories.
>>
>> About RAID volumes, the best performance that we have achieved
>> is using RAID 10 for our Hadoop data directories.
>>
>>
>>
>> On 02/10/2013 09:24 PM, Michael Katzenellenbogen wrote:
>>
>> Are you able to create multiple RAID0 volumes? Perhaps you can exp= ose
>> each disk as its own RAID0 volume...
>>
>> Not sure why or where LVM comes into the picture here ... LVM is o= n
>> the software layer and (hopefully) the RAID/JBOD stuff is at the >> hardware layer (and in the case of HDFS, LVM will only add unneede= d
>> overhead).
>>
>> -Michael
>>
>> On Feb 10, 2013, at 9:19 PM, Jean-Marc Spaggiari<jean-marc@spaggiari.org>
>> <jean-marc@spaggiari= .org> wrote:
>>
>>
>> =A0The issue is that my MB is not doing JBOD :( I have RAID only >> possible, and I'm fighting for the last 48h and still not able= to make
>> it work... That's why I'm thinking about using dfs.data.di= r instead.
>>
>> I have 1 drive per node so far and need to move to 2 to reduce WIO= .
>>
>> What will be better with JBOD against dfs.data.dir? I have done so= me
>> tests JBOD vs LVM and did not find any pros for JBOD so far.
>>
>> JM
>>
>> 2013/2/10, Michael Katzenellenbogen <michael@cloudera.com>
>> <michael@cloudera.com>:
>>
>> =A0One thought comes to mind: disk failure. In the event a disk go= es bad,
>> then with RAID0, you just lost your entire array. With JBOD, you l= ost
>> one disk.
>>
>> -Michael
>>
>> On Feb 10, 2013, at 8:58 PM, Jean-Marc Spaggiari<
jean-marc@spaggiari.org>
>> <jean-marc@spaggiari= .org> wrote:
>>
>>
>> =A0Hi,
>>
>> I have a quick question regarding RAID0 performances vs multiple >> dfs.data.dir entries.
>>
>> Let's say I have 2 x 2TB drives.
>>
>> I can configure them as 2 separate drives mounted on 2 folders and=
>> assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives=
>> with RAID0 and assigned them as a single folder to dfs.data.dir. >>
>> With RAID0, the reads and writes are going to be spread over the 2=
>> disks. This is significantly increasing the speed. But if I put 2<= br> >> entries in dfs.data.dir, hadoop is going to spread over those 2 >> directories too, and at the end, ths results should the same, no?<= br> >>
>> Any experience/advice/results to share?
>>
>> Thanks,
>>
>> JM
>>
>>
>> --
>> Marcos Ortiz Valmaseda,
>> Product Manager && Data Scientist at UCI
>> Blog: http://marcosluis2186.posterous.com
>> Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>
>>
>

--089e01493c988cba4104d5750942--