hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Optimizing Disk I/O - does HDFS do anything ?
Date Sat, 17 Nov 2012 07:27:49 GMT
Ext3 can be quite atrocious when it comes to fragmentation.  Simply start with an empty drive,
and have 8 threads each concurrently write to their own large file sequentially.
ext4 is much better in this regard.
xfs is not as good at initial placement, but has an online defragmenter.
ext4 is fastest on a clean system but eventually can get somewhat fragmented and has no defragmentation
option.
xfs is slow at meta-data operations and I would avoid it for M/R temp for that reason.


I use ext4 for M/R temp, and xfs + online defragmenter for HDFS.  The defragmenter runs nightly
and has little work to do if run regularly.



On 11/13/12 1:10 PM, "Bertrand Dechoux" <dechouxb@gmail.com<mailto:dechouxb@gmail.com>>
wrote:

People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any statement about the
OS used for Hadoop on Azure.)
-> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files fragmentation is
not the only issue. The many small files 'issue' is -in the end- a data fragmentation issue
too and has an impact to read throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <jayunit100@gmail.com<mailto:jayunit100@gmail.com>>
wrote:
How does HDFS deal with optimization of file streaming?  Do data nodes have any optimizations
at the disk level for dealing with fragmented files?  I assume not, but just curious if this
is at all in the works, or if there are java-y ways of dealing with a long running set of
files in an HDFS cluster.  MAybe, for example, data nodes could log the amount of time spent
on I/O for certain files as a way of reporting wether or not defragmentation needed to be
run on  a particular node in a cluster.

--
Jay Vyas
http://jayunit100.blogspot.com


Mime
View raw message