Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C2714DB08 for ; Wed, 8 Aug 2012 16:46:37 +0000 (UTC) Received: (qmail 2783 invoked by uid 500); 8 Aug 2012 16:46:32 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 2673 invoked by uid 500); 8 Aug 2012 16:46:32 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 2666 invoked by uid 99); 8 Aug 2012 16:46:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 16:46:32 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sluangsay@pragsis.com designates 80.26.83.175 as permitted sender) Received: from [80.26.83.175] (HELO pragsis.com) (80.26.83.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 16:46:26 +0000 Received: (qmail 2987 invoked by uid 117); 8 Aug 2012 16:52:48 -0000 Received: from 192.168.2.13 by prglx06 (envelope-from , uid 111) with qmail-scanner-1.25st ( Clear:RC:1(192.168.2.13):. Processed in 0.012418 secs); 08 Aug 2012 16:52:48 -0000 Received: from unknown (HELO MZAFRAPC) (sluangsay@[192.168.2.13]) (envelope-sender ) by pragsis.com (qmail-ldap-1.03) with SMTP for ; 8 Aug 2012 16:52:48 -0000 From: "Sourygna Luangsay" To: Subject: is HDFS RAID "data locality" efficient? Date: Wed, 8 Aug 2012 18:46:03 +0200 Message-ID: <00bc01cd7585$4c585550$e508fff0$@com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_00BD_01CD7596.0FE12550" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac11hUnlfD9NmqAkTJK9aGrOp3YEwg== Content-Language: es x-cr-hashedpuzzle: AO63 A3am CxOj DSgI DY69 Dzc7 EvbX FqEP F43E GzNA H/pd ILsk IyoY Jvdy KZM1 Kuvj;1;dQBzAGUAcgBAAGgAYQBkAG8AbwBwAC4AYQBwAGEAYwBoAGUALgBvAHIAZwA=;Sosha1_v1;7;{A4380B13-5C6C-4E23-A0AD-591A2CDC2663};cwBsAHUAYQBuAGcAcwBhAHkAQABwAHIAYQBnAHMAaQBzAC4AYwBvAG0A;Wed, 08 Aug 2012 16:45:59 GMT;aQBzACAASABEAEYAUwAgAFIAQQBJAEQAIAAiAGQAYQB0AGEAIABsAG8AYwBhAGwAaQB0AHkAIgAgAGUAZgBmAGkAYwBpAGUAbgB0AD8A x-cr-puzzleid: {A4380B13-5C6C-4E23-A0AD-591A2CDC2663} This is a multipart message in MIME format. ------=_NextPart_000_00BD_01CD7596.0FE12550 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi folks! =20 I have just read about the HDFS RAID feature that was added to Hadoop = 0.21 or 0.22. and I am quite curious to know if people use it, what kind of = use they have and what they think about Map/Reduce data locality. =20 First big actor of this technology is Facebook, that claims to save many = PB with it (see http://www.slideshare.net/ydn/hdfs-raid-facebook = slides 4 and 5). =20 I understand the following advantages with HDFS RAID: - You can save space - System tolerates more missing blocks =20 Nonetheless, one of the drawback I see is M/R data locality. As far as I understand, the advantage of having 3 replicas of each = blocks is not only security if one server fails or a block is corrupted, but also the possibility to have as far as 3 tasktrackers executing the = map task with =93local data=94. If you consider the 4th slide of the Facebook presentation, such infrastructure decreases this possibility to only 1 tasktracker. That means that if this tasktracker is very busy executing other tasks, = you have the following choice: - Waiting this tasktracker to finish executing (part of) the current tasks (freeing map slots for instance) - Executing the map task for this block in another tasktracker, transferring the information of the block through the network In both cases, you=B4ll get a M/R penalty (please, tell me if I am = wrong). =20 Has somebody considered such penalty or has some benchmarks to share = with us? =20 One of the scenario I can think in order to take advantage of HDFS RAID without suffering this penalty is: - Using normal HDFS with default replication=3D3 for my = =93fresh data=94 - Using HDFS RAID for my historical data (that is barely used = by M/R) =20 And you, what are you using HDFS RAID for? =20 Regards, =20 Sourygna Luangsay ------=_NextPart_000_00BD_01CD7596.0FE12550 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Hi folks!

 

I have just read about the HDFS = RAID feature that was added to Hadoop 0.21 or 0.22. and I am quite curious to = know if people use it, what kind of use
they have and what they think about Map/Reduce data = locality.

 

First big actor of this = technology is Facebook, that claims to save many PB with it (see http://www.slideshare.net/ydn/hdfs-raid-facebook slides 4 and 5).

 

I understand the following = advantages with HDFS RAID:

-          You can save = space

-          System tolerates more = missing blocks

 

Nonetheless, one of the drawback = I see is M/R data locality.

As far as I understand, the = advantage of having 3 replicas of each blocks is not only security if one server = fails or a block is corrupted,
but also the possibility to have as far as 3 tasktrackers executing the = map task with “local data”.

If you consider the = 4th slide of the Facebook presentation, such infrastructure decreases this = possibility to only 1 tasktracker.

That means that if this = tasktracker is very busy executing other tasks, you have the following = choice:

-          Waiting this = tasktracker to finish executing (part of) the current tasks (freeing map slots for = instance)

-          Executing the map task = for this block in another tasktracker, transferring the information of the block = through the network

In both cases, you=B4ll get a = M/R penalty (please, tell me if I am wrong).

 

Has somebody considered such = penalty or has some benchmarks to share with us?

 

One of the scenario I can think = in order to take advantage of HDFS RAID without suffering this penalty = is:

-          Using normal HDFS with = default replication=3D3 for my “fresh data”

-          Using HDFS RAID for my historical data (that is barely used by M/R)

 

And you, what are you using HDFS = RAID for?

 

Regards,

 

Sourygna = Luangsay

------=_NextPart_000_00BD_01CD7596.0FE12550--