Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of michael@cloudera.com
 designates 209.85.212.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPQV63WMddR-CpXVnM03q2YBNjRms1C7FCjhcBXG3p=CFN9Ncg@mail.gmail.com>
References: 
 <CAPQV63UT8M9=tpBaDHfpnjt9Sr-oAv-Hgam33TF9xZSZUgcQZg@mail.gmail.com>
 <2907752921904630981@unknownmsgid>
 <CAPQV63UefUVvL7aDX1dhLz+JS5uDpsrkYAqsk2BDSN+-9RpTsg@mail.gmail.com>
 <-5730993611063969388@unknownmsgid> <511859C6.7040401@uci.cu>
 <CAND0qzsTPzgZFdErKet+qhowmy2J0iiRHMGx=++dbbDkK3c-EQ@mail.gmail.com>
 <CAPQV63WMddR-CpXVnM03q2YBNjRms1C7FCjhcBXG3p=CFN9Ncg@mail.gmail.com>
From: Michael Katzenellenbogen <michael@cloudera.com>
Date: Mon, 11 Feb 2013 11:02:13 -0500
Message-ID: 
 <CAE0xWVvZ2hY46vUPeTripUmxK4NHUe0fXagNoHdRpL5VkRnpaw@mail.gmail.com>
Subject: Re: Mutiple dfs.data.dir vs RAID0
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e01493c988cba4104d5750942

--089e01493c988cba4104d5750942
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Feb 11, 2013 at 10:54 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> thanks all for your feebacks.
>
> I have updated with hdfs config to add another dfs.data.dir entry and
> restarted the node. Hadoop is starting to use the entry, but is not
> spreading the existing data over the 2 directories.
>
> Let's say you have a 2TB disk on /hadoop1, almost full. If you add
> another 2TB disk on /hadoop2 and add it on dfs.data.dir, hadoop will
> start to write into /hadoop1 and /hadoop2, but /hadoop1 will stay
> almost full. It will not balance the already existing data over the 2
> directories.
>
> I have deleted all the content of /hadoop1 and /hadoop2 and restarted
> the node and now the data is spread over the 2. Just need to wait for
> the replication to complete.
>
> So what I will do instead is, I will add 2 x 2TB drives, mount them as
> raid0 then move the existing data into this drive and remove the
> reprious one. That way hadoop will see still one directory under
> /hadoop1 but it will be 4TB instead of 2TB...
>
> Is there anywhere where I can read about hadoop vs the different kind
> of physical data storage configuration? (Book, web, etc.)
>

"Hadoop Operations" by E. Sammer:
http://shop.oreilly.com/product/0636920025085.do


>
> JM
>
> 2013/2/11, Ted Dunning <tdunning@maprtech.com>:
> > Typical best practice is to have a separate file system per spindle.  If
> > you have a RAID only controller (many are), then you just create one RAID
> > per spindle.  The effect is the same.
> >
> > MapR is unusual able to stripe writes over multiple drives organized
> into a
> > storage pool, but you will not normally be able to achieve that same
> level
> > of performance with ordinary Hadoop by using LVM over JBOD or controller
> > level RAID.  The problem is that the Java layer doesn't understand that
> the
> > storage is striped and the controller doesn't understand what Hadoop is
> > doing.  MapR schedules all of the writes to individual spindles via a
> very
> > fast state machine embedded in the file system.
> >
> > The comment about striping increasing the impact of a single disk drive
> is
> > exactly correct and it makes modeling the failure modes of the system
> > considerably more complex.  The net result of the modeling that I and
> > others have done is that moderate to large RAID groups in storage pools
> for
> > moderate sized clusters (< 2000 nodes or so) is just fine.  For large
> > clusters of up to 10,000 nodes, you should probably limit RAID groups to
> 4
> > drives or less.
> >
> > On Sun, Feb 10, 2013 at 7:39 PM, Marcos Ortiz <mlortiz@uci.cu> wrote:
> >
> >>  We have seen in several of our Hadoop clusters that LVM degrades
> >> performance of our M/R jobs, and I remembered a message where
> >> Ted Dunning was explaining something about this, and since
> >> that time, we don't use LVM for Hadoop data directories.
> >>
> >> About RAID volumes, the best performance that we have achieved
> >> is using RAID 10 for our Hadoop data directories.
> >>
> >>
> >>
> >> On 02/10/2013 09:24 PM, Michael Katzenellenbogen wrote:
> >>
> >> Are you able to create multiple RAID0 volumes? Perhaps you can expose
> >> each disk as its own RAID0 volume...
> >>
> >> Not sure why or where LVM comes into the picture here ... LVM is on
> >> the software layer and (hopefully) the RAID/JBOD stuff is at the
> >> hardware layer (and in the case of HDFS, LVM will only add unneeded
> >> overhead).
> >>
> >> -Michael
> >>
> >> On Feb 10, 2013, at 9:19 PM, Jean-Marc Spaggiari<
> jean-marc@spaggiari.org>
> >> <jean-marc@spaggiari.org> wrote:
> >>
> >>
> >>  The issue is that my MB is not doing JBOD :( I have RAID only
> >> possible, and I'm fighting for the last 48h and still not able to make
> >> it work... That's why I'm thinking about using dfs.data.dir instead.
> >>
> >> I have 1 drive per node so far and need to move to 2 to reduce WIO.
> >>
> >> What will be better with JBOD against dfs.data.dir? I have done some
> >> tests JBOD vs LVM and did not find any pros for JBOD so far.
> >>
> >> JM
> >>
> >> 2013/2/10, Michael Katzenellenbogen <michael@cloudera.com>
> >> <michael@cloudera.com>:
> >>
> >>  One thought comes to mind: disk failure. In the event a disk goes bad,
> >> then with RAID0, you just lost your entire array. With JBOD, you lost
> >> one disk.
> >>
> >> -Michael
> >>
> >> On Feb 10, 2013, at 8:58 PM, Jean-Marc Spaggiari<
> jean-marc@spaggiari.org>
> >> <jean-marc@spaggiari.org> wrote:
> >>
> >>
> >>  Hi,
> >>
> >> I have a quick question regarding RAID0 performances vs multiple
> >> dfs.data.dir entries.
> >>
> >> Let's say I have 2 x 2TB drives.
> >>
> >> I can configure them as 2 separate drives mounted on 2 folders and
> >> assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives
> >> with RAID0 and assigned them as a single folder to dfs.data.dir.
> >>
> >> With RAID0, the reads and writes are going to be spread over the 2
> >> disks. This is significantly increasing the speed. But if I put 2
> >> entries in dfs.data.dir, hadoop is going to spread over those 2
> >> directories too, and at the end, ths results should the same, no?
> >>
> >> Any experience/advice/results to share?
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >>
> >> --
> >> Marcos Ortiz Valmaseda,
> >> Product Manager && Data Scientist at UCI
> >> Blog: http://marcosluis2186.posterous.com
> >> Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>
> >>
> >
>

--089e01493c988cba4104d5750942
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Mon, Feb 11, 2013 at 10:54 AM, Jean-Marc Spaggiari <span dir=3D"ltr">&lt=
;<a href=3D"mailto:jean-marc@spaggiari.org" target=3D"_blank">jean-marc@spa=
ggiari.org</a>&gt;</span> wrote:<br><div class=3D"gmail_quote"><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">

thanks all for your feebacks.<br>
<br>
I have updated with hdfs config to add another dfs.data.dir entry and<br>
restarted the node. Hadoop is starting to use the entry, but is not<br>
spreading the existing data over the 2 directories.<br>
<br>
Let&#39;s say you have a 2TB disk on /hadoop1, almost full. If you add<br>
another 2TB disk on /hadoop2 and add it on dfs.data.dir, hadoop will<br>
start to write into /hadoop1 and /hadoop2, but /hadoop1 will stay<br>
almost full. It will not balance the already existing data over the 2<br>
directories.<br>
<br>
I have deleted all the content of /hadoop1 and /hadoop2 and restarted<br>
the node and now the data is spread over the 2. Just need to wait for<br>
the replication to complete.<br>
<br>
So what I will do instead is, I will add 2 x 2TB drives, mount them as<br>
raid0 then move the existing data into this drive and remove the<br>
reprious one. That way hadoop will see still one directory under<br>
/hadoop1 but it will be 4TB instead of 2TB...<br>
<br>
Is there anywhere where I can read about hadoop vs the different kind<br>
of physical data storage configuration? (Book, web, etc.)<br></blockquote><=
div><br></div><div>&quot;Hadoop Operations&quot; by E. Sammer: <a href=3D"h=
ttp://shop.oreilly.com/product/0636920025085.do">http://shop.oreilly.com/pr=
oduct/0636920025085.do</a></div>

<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
<br>
JM<br>
<br>
2013/2/11, Ted Dunning &lt;<a href=3D"mailto:tdunning@maprtech.com">tdunnin=
g@maprtech.com</a>&gt;:<br>
<div><div class=3D"h5">&gt; Typical best practice is to have a separate fil=
e system per spindle. =A0If<br>
&gt; you have a RAID only controller (many are), then you just create one R=
AID<br>
&gt; per spindle. =A0The effect is the same.<br>
&gt;<br>
&gt; MapR is unusual able to stripe writes over multiple drives organized i=
nto a<br>
&gt; storage pool, but you will not normally be able to achieve that same l=
evel<br>
&gt; of performance with ordinary Hadoop by using LVM over JBOD or controll=
er<br>
&gt; level RAID. =A0The problem is that the Java layer doesn&#39;t understa=
nd that the<br>
&gt; storage is striped and the controller doesn&#39;t understand what Hado=
op is<br>
&gt; doing. =A0MapR schedules all of the writes to individual spindles via =
a very<br>
&gt; fast state machine embedded in the file system.<br>
&gt;<br>
&gt; The comment about striping increasing the impact of a single disk driv=
e is<br>
&gt; exactly correct and it makes modeling the failure modes of the system<=
br>
&gt; considerably more complex. =A0The net result of the modeling that I an=
d<br>
&gt; others have done is that moderate to large RAID groups in storage pool=
s for<br>
&gt; moderate sized clusters (&lt; 2000 nodes or so) is just fine. =A0For l=
arge<br>
&gt; clusters of up to 10,000 nodes, you should probably limit RAID groups =
to 4<br>
&gt; drives or less.<br>
&gt;<br>
&gt; On Sun, Feb 10, 2013 at 7:39 PM, Marcos Ortiz &lt;<a href=3D"mailto:ml=
ortiz@uci.cu">mlortiz@uci.cu</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; =A0We have seen in several of our Hadoop clusters that LVM degrade=
s<br>
&gt;&gt; performance of our M/R jobs, and I remembered a message where<br>
&gt;&gt; Ted Dunning was explaining something about this, and since<br>
&gt;&gt; that time, we don&#39;t use LVM for Hadoop data directories.<br>
&gt;&gt;<br>
&gt;&gt; About RAID volumes, the best performance that we have achieved<br>
&gt;&gt; is using RAID 10 for our Hadoop data directories.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On 02/10/2013 09:24 PM, Michael Katzenellenbogen wrote:<br>
&gt;&gt;<br>
&gt;&gt; Are you able to create multiple RAID0 volumes? Perhaps you can exp=
ose<br>
&gt;&gt; each disk as its own RAID0 volume...<br>
&gt;&gt;<br>
&gt;&gt; Not sure why or where LVM comes into the picture here ... LVM is o=
n<br>
&gt;&gt; the software layer and (hopefully) the RAID/JBOD stuff is at the<b=
r>
&gt;&gt; hardware layer (and in the case of HDFS, LVM will only add unneede=
d<br>
&gt;&gt; overhead).<br>
&gt;&gt;<br>
&gt;&gt; -Michael<br>
&gt;&gt;<br>
&gt;&gt; On Feb 10, 2013, at 9:19 PM, Jean-Marc Spaggiari&lt;<a href=3D"mai=
lto:jean-marc@spaggiari.org">jean-marc@spaggiari.org</a>&gt;<br>
&gt;&gt; &lt;<a href=3D"mailto:jean-marc@spaggiari.org">jean-marc@spaggiari=
.org</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; =A0The issue is that my MB is not doing JBOD :( I have RAID only<b=
r>
&gt;&gt; possible, and I&#39;m fighting for the last 48h and still not able=
 to make<br>
&gt;&gt; it work... That&#39;s why I&#39;m thinking about using dfs.data.di=
r instead.<br>
&gt;&gt;<br>
&gt;&gt; I have 1 drive per node so far and need to move to 2 to reduce WIO=
.<br>
&gt;&gt;<br>
&gt;&gt; What will be better with JBOD against dfs.data.dir? I have done so=
me<br>
&gt;&gt; tests JBOD vs LVM and did not find any pros for JBOD so far.<br>
&gt;&gt;<br>
&gt;&gt; JM<br>
&gt;&gt;<br>
&gt;&gt; 2013/2/10, Michael Katzenellenbogen &lt;<a href=3D"mailto:michael@=
cloudera.com">michael@cloudera.com</a>&gt;<br>
&gt;&gt; &lt;<a href=3D"mailto:michael@cloudera.com">michael@cloudera.com</=
a>&gt;:<br>
&gt;&gt;<br>
&gt;&gt; =A0One thought comes to mind: disk failure. In the event a disk go=
es bad,<br>
&gt;&gt; then with RAID0, you just lost your entire array. With JBOD, you l=
ost<br>
&gt;&gt; one disk.<br>
&gt;&gt;<br>
&gt;&gt; -Michael<br>
&gt;&gt;<br>
&gt;&gt; On Feb 10, 2013, at 8:58 PM, Jean-Marc Spaggiari&lt;<a href=3D"mai=
lto:jean-marc@spaggiari.org">jean-marc@spaggiari.org</a>&gt;<br>
&gt;&gt; &lt;<a href=3D"mailto:jean-marc@spaggiari.org">jean-marc@spaggiari=
.org</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; =A0Hi,<br>
&gt;&gt;<br>
&gt;&gt; I have a quick question regarding RAID0 performances vs multiple<b=
r>
&gt;&gt; dfs.data.dir entries.<br>
&gt;&gt;<br>
&gt;&gt; Let&#39;s say I have 2 x 2TB drives.<br>
&gt;&gt;<br>
&gt;&gt; I can configure them as 2 separate drives mounted on 2 folders and=
<br>
&gt;&gt; assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives=
<br>
&gt;&gt; with RAID0 and assigned them as a single folder to dfs.data.dir.<b=
r>
&gt;&gt;<br>
&gt;&gt; With RAID0, the reads and writes are going to be spread over the 2=
<br>
&gt;&gt; disks. This is significantly increasing the speed. But if I put 2<=
br>
&gt;&gt; entries in dfs.data.dir, hadoop is going to spread over those 2<br=
>
&gt;&gt; directories too, and at the end, ths results should the same, no?<=
br>
&gt;&gt;<br>
&gt;&gt; Any experience/advice/results to share?<br>
&gt;&gt;<br>
&gt;&gt; Thanks,<br>
&gt;&gt;<br>
&gt;&gt; JM<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Marcos Ortiz Valmaseda,<br>
&gt;&gt; Product Manager &amp;&amp; Data Scientist at UCI<br>
&gt;&gt; Blog: <a href=3D"http://marcosluis2186.posterous.com" target=3D"_b=
lank">http://marcosluis2186.posterous.com</a><br>
</div></div>&gt;&gt; Twitter: @marcosluis2186 &lt;<a href=3D"http://twitter=
.com/marcosluis2186" target=3D"_blank">http://twitter.com/marcosluis2186</a=
>&gt;<br>
&gt;&gt;<br>
&gt;<br>
</blockquote></div><br>

--089e01493c988cba4104d5750942--