Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 63136DB7A for ; Wed, 8 Aug 2012 18:32:28 +0000 (UTC) Received: (qmail 82175 invoked by uid 500); 8 Aug 2012 18:32:23 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 82054 invoked by uid 500); 8 Aug 2012 18:32:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 82045 invoked by uid 99); 8 Aug 2012 18:32:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 18:32:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ajit.ratnaparkhi@gmail.com designates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 18:32:18 +0000 Received: by yhfq46 with SMTP id q46so1301413yhf.35 for ; Wed, 08 Aug 2012 11:31:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=AiwdgaGhaloNqxdrhQ5N4k+H8A7SWVNHse7u3v6D2S0=; b=WBjp9iOabeVZSupKEi4vCDMHfYJFLDh4dCBH3pi9XgRdrbxN/UwnUEU6STPkWkwWO3 dFEGM5adkDBXL1UbUffAoA1TtD3ruN85+teHfhBIjHTsFjfW1E+Er4/R5hKBbsTQV0Cm kciBVLe9octRKYjtmYYApJf9Hq5VyHh0pFZjNKh3L3x/nBPyTB2SiOAiMAfZoSE+d+HE UE8v2hnKEiAOkQ/oQVehnQMkZtcz7u9JXyTvx595JEAnkY6XDsCWrJKiZOS/o4T7bohc /yD9U0baHEok0rUH8MbPNX8LSovz1gT4zhstKUlfadnHRV6Zbm9B+SRrnOLYf9Ch+z+q +b1A== Received: by 10.50.158.229 with SMTP id wx5mr1150895igb.23.1344450717789; Wed, 08 Aug 2012 11:31:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.31.3 with HTTP; Wed, 8 Aug 2012 11:31:37 -0700 (PDT) In-Reply-To: References: <00bc01cd7585$4c585550$e508fff0$@com> From: Ajit Ratnaparkhi Date: Thu, 9 Aug 2012 00:01:37 +0530 Message-ID: Subject: Re: is HDFS RAID "data locality" efficient? To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae934062b73ac8604c6c55370 X-Virus-Checked: Checked by ClamAV on apache.org --14dae934062b73ac8604c6c55370 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Agreed with Steve. That is most important use of HDFS RAID, where you consume less disk space with same reliability and availability guarantee at cost of processing performance. Most of data in hdfs is cold data, without HDFS RAID you end up maintaining 3 replicas of data which is hardly going to be processed again, but you cant remove/move this data to separate archive because if required processing should be as soon as possible. -Ajit On Wed, Aug 8, 2012 at 11:01 PM, Steve Loughran wro= te: > > > On 8 August 2012 09:46, Sourygna Luangsay wrote: > >> Hi folks!**** >> >> One of the scenario I can think in order to take advantage of HDFS RAID >> without suffering this penalty is:** >> >> **- **Using normal HDFS with default replication=3D3 for my >> =93fresh data=94**** >> >> **- **Using HDFS RAID for my historical data (that is barely >> used by M/R)**** >> >> ** ** >> >> >> > exactly: less space use on cold data, with the penalty that access > performance can be worse. As the majority of data on a hadoop cluster is > usually "cold", it's a space and power efficient story for the archive da= ta > > -- > Steve Loughran > Hortonworks Inc > > --14dae934062b73ac8604c6c55370 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Agreed with Steve.
That is most important use of HDFS RAID, where you c= onsume less disk space with same reliability and availability guarantee at = cost of processing performance. Most of data in hdfs is cold data, without = HDFS RAID you end up maintaining 3 replicas of data which is hardly going t= o be processed again, but you cant remove/move this data to separate archiv= e because if =A0required processing should be as soon as possible.

-Ajit

On Wed, Aug 8, = 2012 at 11:01 PM, Steve Loughran <stevel@hortonworks.com> wrote:


On 8 August 2012 09:46, Sourygna Luangsay <sluangsay@pragsis.= com> wrote:

Hi folks!

One of the scenario I can think in order to take advantage of HDFS RAID without suffering this penalty is:

-=A0=A0=A0=A0=A0=A0=A0=A0=A0 Using normal HDFS with def= ault replication=3D3 for my =93fresh data=94

-=A0=A0=A0=A0=A0=A0=A0=A0=A0 Using HDFS RAID for my historical data (that is barely used by M/R)

=A0


exactly: less space use on cold data, with the penalty that ac= cess performance can be worse. As the majority of data on a hadoop cluster = is usually "cold", it's a space and power efficient story for= the archive data

--
Steve Loughran
Hortonworks Inc


--14dae934062b73ac8604c6c55370--