hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David B. Ritch" <david.ri...@gmail.com>
Subject Re: RAID vs. JBOD
Date Mon, 12 Jan 2009 12:35:34 GMT
Thank you - yes, I'm fairly confident that it will work either way.  I'm
trying to find out whether there is an established best practice, and
the performance impact of the decision between RAID 0 and JBOD.
I'll check out the noatime and nodiratime for their effect on our
performance - thanks for that suggestion, as well.

Jason Venner wrote:
> If you put your dfs directory as a set of comma separated tokens you
> will do fine.
> <property>
>  <name>dfs.data.dir</name>
>  <value>${hadoop.tmp.dir}/dfs/data</value>
>  <description>Determines where on the local filesystem an DFS data node
>  should store its blocks.  If this is a comma-delimited
>  list of directories, then data will be stored in all named
>  directories, typically on different devices.
>  Directories that do not exist are ignored.
>  </description>
> </property>
> The namenode does a lot of small writes, so raid 1, 10 is better.
> Also it having the file system mounts for the dfs.data.dir be noatime
> and nodiratime makes a significant performance difference.
> David B. Ritch wrote:
>> How well does Hadoop handle multiple independent disks per node?
>> I have a cluster with 4 identical disks per node.  I plan to use one
>> disk for OS and temporary storage, and dedicate the other three to
>> HDFS.  Our IT folks have some disagreement as to whether the three disks
>> should be striped, or treated by HDFS as three independent disks.  Could
>> someone with more HDFS experience comment on the relative advantages and
>> disadvantages to each approach?
>> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
>> striped partition, and we wouldn't have to worry about balancing files
>> between them.  Single-file I/O should be considerably faster.  On the
>> other hand, I would expect typical use to require multiple files reads
>> or write simultaneously.  I would expect Hadoop to be able to manage
>> read/write to/from the disks independently.  Managing 3 streams to 3
>> independent devices would likely result in less disk head movement, and
>> therefore better performance.  I would expect Hadoop to be able to
>> balance load between the disks fairly well.  Availability doesn't really
>> differentiate between the two approaches - if a single disk dies, the
>> striped array would go down, but all its data should be replicated on
>> another datanode, anyway.  And besides, I understand that datanode will
>> shut down a node, even if only one of 3 independent disks crashes.
>> So - any one want to agree or disagree with these thoughts?  Anyone have
>> any other ideas, or - better - benchmarks and experience with layouts
>> like these two?
>> Thanks!
>> David

View raw message