hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Buell <jbu...@vmware.com>
Subject RE: HDFS using SAN
Date Tue, 16 Oct 2012 21:24:36 GMT
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct
comparisons (but not published them yet).  Direct local storage is likely to have much more
capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it
with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection
for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than
the sum of the capacity of the wires connected to it).  There will definitely be a hard limit
to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple

Locality is an issue.  Even though each host has a direct physical access to all the data,
a "remote" access in HDFS will still have to go over the network to the host that owns the
data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to
configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on
the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data
is protected, access to the data is lost if the datanode goes down.  You can get around that
by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically
restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case
but this strategy is a bit bleeding-edge.


From: Pamecha, Abhishek [mailto:apamecha@x.com]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org
Subject: HDFS using SAN


I have read scattered documentation across the net which mostly say HDFS doesn't go well with
SAN being used to store data. While some say, it is an emerging trend. I would love to know
if there have been any tests performed which hint on what aspects does a direct storage excels/falls
behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest
cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude
more of iops we care about for now, but given it is a shared infrastructure and we may expand
our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how
does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for
data storage in HDFS, I would love to receive your views.


View raw message