hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Rutman <nrut...@gmail.com>
Subject Re: HDFS without Hadoop: Why?
Date Wed, 26 Jan 2011 01:31:25 GMT

On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:

> I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as
some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes
of data, and it doesn't tolerate node failure.

When talking about large amounts of data, 3x redundancy absolutely doesn't scale.  Nobody
is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data.  This
is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex,
builds).  Redundant controllers, battery backup, etc.  The incremental cost for an additional
drive in such systems is negligible.  

> A key part of hdfs is the distributed part.

Granted, single-point-of-failure arguments are valid when concentrating all the storage together,
but can be generally dealt with using hardware and software failover techniques.   

The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations
that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives
instead of 10 is an acceptable cost for the simplicity of HDFS setup.

> Best,
> -stu
> -----Original Message-----
> From: Nathan Rutman <nrutman@gmail.com>
> Date: Tue, 25 Jan 2011 16:32:07 
> To: <hdfs-user@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
>> Hi,
>> Why would 3x data seem wasteful? 
>> This is exactly what you want.  I would never store any serious business data without
some form of replication.
> I agree that you want data backup, but 3x replication is the least efficient / most expensive
(space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance
against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot
to note this in my original email, but that's what I had in mind.) RAID is also not necessarily
$ expensive either; Linux MD RAID is free and effective.
>> What happens if you store a single file on a single server without replicas and that
server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed
file system uses replication to prevent data loss. As a side affect having the same replica
of a data piece on separate servers means that more than one task can work on the server in
> Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.
 But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any
case, a distributed filesystem would let you work on the same block of data from however many
nodes you wanted.

View raw message