hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: is HDFS RAID "data locality" efficient?
Date Wed, 08 Aug 2012 17:06:04 GMT
Just something to think about... 

There's a company here in Chicago called Cleversafe. I believe they recently made an announcement
concerning Hadoop? 

The interesting thing about RAID is that you're adding to the disk latency and depending on
which raid you use you could kill performance on a rebuild of a disk. 

In terms of uptime of Apache based Hadoop, RAID allows you to actually hot swap the disks
and unless you lose both drives (assuming Raid 1, mirroring), your DN doesn't know and doesn't
have to go down. 
So there is some value there, however at the expense of storage and storage costs. 

You can reduce the replication factor to 2. I don't know that I would go to anything lower
because you still can lose the server... 

In terms of data locality... maybe you lose a bit, however... because you're raiding your
storage, you now have less data per node. So you end up with more nodes, right? 

Just some food for thought. 

On Aug 8, 2012, at 11:46 AM, Sourygna Luangsay <sluangsay@pragsis.com> wrote:

> Hi folks!
> I have just read about the HDFS RAID feature that was added to Hadoop 0.21 or 0.22. and
I am quite curious to know if people use it, what kind of use
> they have and what they think about Map/Reduce data locality.
> First big actor of this technology is Facebook, that claims to save many PB with it (see
http://www.slideshare.net/ydn/hdfs-raid-facebook slides 4 and 5).
> I understand the following advantages with HDFS RAID:
> -          You can save space
> -          System tolerates more missing blocks
> Nonetheless, one of the drawback I see is M/R data locality.
> As far as I understand, the advantage of having 3 replicas of each blocks is not only
security if one server fails or a block is corrupted,
> but also the possibility to have as far as 3 tasktrackers executing the map task with
“local data”.
> If you consider the 4th slide of the Facebook presentation, such infrastructure decreases
this possibility to only 1 tasktracker.
> That means that if this tasktracker is very busy executing other tasks, you have the
following choice:
> -          Waiting this tasktracker to finish executing (part of) the current tasks (freeing
map slots for instance)
> -          Executing the map task for this block in another tasktracker, transferring
the information of the block through the network
> In both cases, you´ll get a M/R penalty (please, tell me if I am wrong).
> Has somebody considered such penalty or has some benchmarks to share with us?
> One of the scenario I can think in order to take advantage of HDFS RAID without suffering
this penalty is:
> -          Using normal HDFS with default replication=3 for my “fresh data”
> -          Using HDFS RAID for my historical data (that is barely used by M/R)
> And you, what are you using HDFS RAID for?
> Regards,
> Sourygna Luangsay

View raw message