Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: unknown (athena.apache.org: error in processing during lookup of
 scott@richrelevance.com)
From: Scott Carey <scott@richrelevance.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Re: Optimizing Disk I/O - does HDFS do anything ?
Thread-Topic: Optimizing Disk I/O - does HDFS do anything ?
Thread-Index: AQHNwd29Kqj3FUwb2UGOcJlLqIUilpfoyOSAgATddYA=
Date: Sat, 17 Nov 2012 07:27:49 +0000
Message-ID: 
 <406071A75846C1449BAD4F5ED6315BEF1CB516F2@mbx025-e1-nj-8.exch025.domain.local>
In-Reply-To: 
 <CAO6W-2e5nNA2SMBY5g7SDocd=R98PBN=eHC_tMqZZXN7GMu_9Q@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.4.120824
Content-Type: multipart/alternative;
	boundary="_000_406071A75846C1449BAD4F5ED6315BEF1CB516F2mbx025e1nj8exch_"
MIME-Version: 1.0

--_000_406071A75846C1449BAD4F5ED6315BEF1CB516F2mbx025e1nj8exch_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Ext3 can be quite atrocious when it comes to fragmentation.  Simply start w=
ith an empty drive, and have 8 threads each concurrently write to their own=
 large file sequentially.
ext4 is much better in this regard.
xfs is not as good at initial placement, but has an online defragmenter.
ext4 is fastest on a clean system but eventually can get somewhat fragmente=
d and has no defragmentation option.
xfs is slow at meta-data operations and I would avoid it for M/R temp for t=
hat reason.


I use ext4 for M/R temp, and xfs + online defragmenter for HDFS.  The defra=
gmenter runs nightly and has little work to do if run regularly.


On 11/13/12 1:10 PM, "Bertrand Dechoux" <dechouxb@gmail.com<mailto:dechouxb=
@gmail.com>> wrote:

People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any st=
atement about the OS used for Hadoop on Azure.)
-> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defra=
gmenting/
2) files are written in one go with big blocks. (And actually, the files fr=
agmentation is not the only issue. The many small files 'issue' is -in the =
end- a data fragmentation issue too and has an impact to read throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <jayunit100@gmail.com<mailto:jayu=
nit100@gmail.com>> wrote:
How does HDFS deal with optimization of file streaming?  Do data nodes have=
 any optimizations at the disk level for dealing with fragmented files?  I =
assume not, but just curious if this is at all in the works, or if there ar=
e java-y ways of dealing with a long running set of files in an HDFS cluste=
r.  MAybe, for example, data nodes could log the amount of time spent on I/=
O for certain files as a way of reporting wether or not defragmentation nee=
ded to be run on  a particular node in a cluster.

--
Jay Vyas
http://jayunit100.blogspot.com


--_000_406071A75846C1449BAD4F5ED6315BEF1CB516F2mbx025e1nj8exch_
Content-Type: text/html; charset="us-ascii"
Content-ID: <1DA6C84D1A7D6049B83AA01460D099A9@exch025.domain.local>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif; ">
<div>Ext3 can be quite atrocious when it comes to fragmentation. &nbsp;Simp=
ly start with an empty drive, and have 8 threads each concurrently write to=
 their own large file sequentially.</div>
<div>ext4 is much better in this regard.</div>
<div>xfs is not as good at initial placement, but has an online defragmente=
r.</div>
<div>
<div>ext4 is fastest on a clean system but eventually can get somewhat frag=
mented and has no defragmentation option.</div>
</div>
<div>xfs is slow at meta-data operations and I would avoid it for M/R temp =
for that reason.</div>
<div><br>
</div>
<div><br>
</div>
<div>I use ext4 for M/R temp, and xfs &#43; online defragmenter for HDFS. &=
nbsp;The defragmenter runs nightly and has little work to do if run regular=
ly.</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div>
<div>On 11/13/12 1:10 PM, &quot;Bertrand Dechoux&quot; &lt;<a href=3D"mailt=
o:dechouxb@gmail.com">dechouxb@gmail.com</a>&gt; wrote:</div>
</div>
<div><br>
</div>
<blockquote id=3D"MAC_OUTLOOK_ATTRIBUTION_BLOCKQUOTE" style=3D"BORDER-LEFT:=
 #b5c4df 5 solid; PADDING:0 0 0 5; MARGIN:0 0 0 5;">
<div>
<div>People are welcome to complement but I guess the answer is :<br>
1) Hadoop is not running on windows (I am not sure if Microsoft made any st=
atement about the OS used for Hadoop on Azure.)<br>
-&gt; <a href=3D"http://www.howtogeek.com/115229/htg-explains-why-linux-doe=
snt-need-defragmenting/">
http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragme=
nting/</a><br>
2) files are written in one go with big blocks. (And actually, the files fr=
agmentation is not the only issue. The many small files 'issue' is -in the =
end- a data fragmentation issue too and has an impact to read throughput.)<=
br>
<br>
Bertrand Dechoux<br>
<br>
<div class=3D"gmail_quote">On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <span =
dir=3D"ltr">
&lt;<a href=3D"mailto:jayunit100@gmail.com" target=3D"_blank">jayunit100@gm=
ail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
How does HDFS deal with optimization of file streaming? &nbsp;Do data nodes=
 have any optimizations at the disk level for dealing with fragmented files=
? &nbsp;I assume not, but just curious if this is at all in the works, or i=
f there are java-y ways of dealing with a
 long running set of files in an HDFS cluster. &nbsp;MAybe, for example, da=
ta nodes could log the amount of time spent on I/O for certain files as a w=
ay of reporting wether or not defragmentation needed to be run on &nbsp;a p=
articular node in a cluster. &nbsp;&nbsp;<span class=3D"HOEnZb"><font color=
=3D"#888888">
<div>
<div><br>
</div>
-- <br>
Jay Vyas<br>
<a href=3D"http://jayunit100.blogspot.com" target=3D"_blank">http://jayunit=
100.blogspot.com</a><br>
</div>
</font></span></blockquote>
</div>
<br>
</div>
</div>
</blockquote>
</span>
</body>
</html>

--_000_406071A75846C1449BAD4F5ED6315BEF1CB516F2mbx025e1nj8exch_--