hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: RAID vs. JBOD
Date Sun, 11 Jan 2009 22:55:22 GMT
If you put your dfs directory as a set of comma separated tokens you 
will do fine.

<property>
  <name>dfs.data.dir</name>
  <value>${hadoop.tmp.dir}/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

The namenode does a lot of small writes, so raid 1, 10 is better.

Also it having the file system mounts for the dfs.data.dir be noatime 
and nodiratime makes a significant performance difference.

David B. Ritch wrote:
> How well does Hadoop handle multiple independent disks per node?
>
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three disks
> should be striped, or treated by HDFS as three independent disks.  Could
> someone with more HDFS experience comment on the relative advantages and
> disadvantages to each approach?
>
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement, and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode will
> shut down a node, even if only one of 3 independent disks crashes.
>
> So - any one want to agree or disagree with these thoughts?  Anyone have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
>
> Thanks!
>
> David
>   

Mime
View raw message