hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Buchanan <John.Bucha...@infinitecampus.com>
Subject HDFS drive, partition best practice
Date Mon, 07 Feb 2011 20:25:09 GMT

My company will be building a small but quickly growing Hadoop deployment, and I had a question
regarding best practice for configuring the storage for the datanodes.  Cloudera has a page
where they recommend a JBOD configuration over RAID.  My question, though, is whether they
are referring to the simplest definition of JBOD, that being literally just a collection of
heterogeneous drives, each with its own distinct partition and mount point?  Or are they referring
to a concatenated span of heterogeneous drives presented to the OS as a single device?

Through some digging I've discovered that data volumes may be specified in a comma-delimited
fashion in the hdfs-site.xml file and are then accessed individually, but are of course all
available within the pool.  To test this I brought a Ubuntu Server 10.04 VM online (on Xen
Cloud Platform) with 3 storage volumes.  The first is the OS, I created a single partition
the second and third, mounting them as /hadoop-datastore/a and /hadoop-datastore/b respectively,
specifying them in hdfs-site.xml in comma-delimited fashion.  I then continued to construct
a single node pseudo-distributed install, executed the bin/start-all.sh script, and all seems
just great.  The volumes are 5GB each, and HDFS status page shows a configured capacity of
9.84GB, so both are in use, I successfully added a file using bin/hadoop dfs –put.

This lead me to think that perhaps an optimal datanode configuration would be 2 drives in
Raid1 for OS, then 2-4 additional drives for data, individually partitioned, mounted, and
configured in hdfs-site.xml.  Mirrored system drives would make my node more robust but data
drives would still be independent.  I do realize that HDFS assures data redundancy at a higher
level by design, but if the loss of a single drive necessitated rebuilding an entire node,
and therefore being down in capacity during that period, just doesn't seem to be the most
efficient approach.

Would love to hear what others are doing in this regard, whether anyone is using concatenated
disks and whether the loss of a drive requires them to rebuild the entire system.

John Buchanan

View raw message