hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: TestDFSIO on Lustre vs HDFS
Date Sat, 29 Jan 2011 06:35:21 GMT

On Jan 28, 2011, at 10:39 AM, Nathan Rutman wrote:
> Your storage type should depend on the kind of data your storing, the quantity, the reliability,
scalabilty, heterogenicity (sic), data access pattern, applications you're using, performance
requirements, and system cost.   My point in posting this stuff is not to say the Lustre should
be your choice for Hadoop backend in all situations.  It was really to show that HDFS was
designed for a particular usage pattern and scale, and using it outside of that realm may
not be the best choice.  I was looking to the HDFS community to poke holes in my arguments.

	People who approach HDFS from a pure filesystem perspective are often disappointed because
they miss out on the fact that it is written primarily to support Hadoop's MapReduce framework.
 In particular, this means having access to data locality information so that the network
hit is mostly immaterial when reading or writing.   It is going to make a huge difference
if you are reading a single TB file from one node for processing (which in turn will likely
require many many block fetches from across the network) vs. being able to distribute that
read to multiple hosts (such that there are is little-to-no network activity at all).

>  Also, to get improved Hadoop performance, the network needs to be more expensive than

	Hardly, especially when trunking is thrown into the mix.  

> And Lustre requires more sysadmin care and understanding, which adds to total cost of
> But all of that is a "fixed" cost -- it does not scale linearly with your storage size.
If you double your storage requirement, you'll pay ~1.2x for RAID parity and spare space with
Lustre, but you'll pay 3x for HDFS disks.  The Lustre initial costs are higher.  So at some
scale there will necessarily be a cost crossover.

	As nodes are added, the network costs will also go up, regardless of setup.  The only time
they don't is if the original design had significantly over provisioned network vs. node count.
 Only using 8 nodes hides this fact.

View raw message