Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A8316200AC5 for ; Sun, 5 Jun 2016 21:14:20 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A6B3E160A28; Sun, 5 Jun 2016 19:14:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A1195160A25 for ; Sun, 5 Jun 2016 21:14:19 +0200 (CEST) Received: (qmail 83365 invoked by uid 500); 5 Jun 2016 19:14:17 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 83354 invoked by uid 99); 5 Jun 2016 19:14:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Jun 2016 19:14:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id F2ED71A01D4 for ; Sun, 5 Jun 2016 19:14:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, LOTS_OF_MONEY=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id KiKvVxHrNTFM for ; Sun, 5 Jun 2016 19:14:12 +0000 (UTC) Received: from mail-yw0-f170.google.com (mail-yw0-f170.google.com [209.85.161.170]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id ABF3A5F54F for ; Sun, 5 Jun 2016 19:14:11 +0000 (UTC) Received: by mail-yw0-f170.google.com with SMTP id x189so123463525ywe.3 for ; Sun, 05 Jun 2016 12:14:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=pa4VoGyF3mZ7vFhk76FUA5ch0Mi1x+aqdebvvOGKFPY=; b=bSZUxL+50DKU2apHKx7ufQzh/SYaNy4by/gSE5FGC3nWivjiVoRjtkVpjXykavxlw4 AAzLA3CVFBpG7NXHm6i8dVReaRGZMLmDgjgY9zQk7n3lSx8XwsEkX+wmbEYQrYy2fG2K 1HdMujkFbNknLGM2QkFE6kURMJ7069cvag1WYwKgDq7YLnSC73Rj1v3fgDbA2Ow6vncv QQGEC0lN0SAsSbAfWRar6tU82WHKcTwxJdDh+b0xLgPN3DnhKaSrHfqG+Of7S+7ixylp xHIMD7Gg7XF7jxJxmnrmA9q3dcp1Am77QF+uVtYX8T8BtZAiRIspepxCoabiC9I8Kpxc dVtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=pa4VoGyF3mZ7vFhk76FUA5ch0Mi1x+aqdebvvOGKFPY=; b=HC/hfy3LQaKdKCIa4AM1nsy3g2CZIDH3F/iuTquSeb0AbHYotF88tFOKtG9pVWH4Ly u3UW7KqZt3o55xkLuJFzAp51liHPS37CIX/rzHX9OJT/EaapfclxSEPCSJBqcPU/Y7pC O04qMpheq4/Sz2oWotjr1VUovjAUdN0lsNnbjPMahhsTqXLQoyKltcovaFND31nVTAhy TDMDEO5+LT//g5FjOsHJQHjhiElPDUaqDW9HCcP/lJ4R0u5z9WUmPshbpCXFAIRKk3WC KAvq70Von30a/MwLNK0EuIBSMdz7oO/CtLCDvTnhXE9WiviaGdzmW4WC0dzCwnZkYJuH 4WFg== X-Gm-Message-State: ALyK8tIRI3I4Gu/l86UliteWVAlKA+jjMjtFMSIPvNfYSrHRM3j8qgCkhWq9iROOWNDPse7yb58NHrVLP7uo7A== X-Received: by 10.129.136.135 with SMTP id y129mr8723819ywf.158.1465154045196; Sun, 05 Jun 2016 12:14:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.43.213 with HTTP; Sun, 5 Jun 2016 12:14:04 -0700 (PDT) In-Reply-To: References: From: Peyman Mohajerian Date: Sun, 5 Jun 2016 12:14:04 -0700 Message-ID: Subject: Re: HDFS2 vs MaprFS To: Marcin Tustin Cc: Ascot Moss , user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a114f0aaa67c11205348cc290 archived-at: Sun, 05 Jun 2016 19:14:20 -0000 --001a114f0aaa67c11205348cc290 Content-Type: text/plain; charset=UTF-8 It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas. On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin wrote: > The namenode architecture is a source of fragility in HDFS. While a high > availability deployment (with two namenodes, and a failover mechanism) > means you're unlikely to see service interruption, it is still possible to > have a complete loss of filesystem metadata with the loss of two machines. > > Secondly, because HDFS identifies datanodes by their hostname/ip, dns > changes can cause havoc with HDFS (see my war story on this here: > https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab > ). > > Also, the namenode/datanode architecture probably does contribute to the > small files problem being a problem. That said, there are lot of practical > solutions for the small files problem. > > If you're just setting up a data infrastructure, I would say consider > alternatives before you pick HDFS. If you run in AWS, S3 is a good > alternative. If you run in some other cloud, it's probably worth > considering whatever their equivalent storage system is. > > > On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss wrote: > >> Hi, >> >> I read some (old?) articles from Internet about Mapr-FS vs HDFS. >> >> https://www.mapr.com/products/m5-features/no-namenode-architecture >> >> It states that HDFS Federation has >> >> a) "Multiple Single Points of Failure", is it really true? >> Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to >> an unfair comparison (or even misleading comparison)? (HDFS was from >> Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there >> is no any Single Points of Failure in HDFS2. >> >> b) "Limit to 50-200 million files", is it really true? >> I have seen so many real world Hadoop Clusters with over 10PB data, some >> even with 150PB data. If "Limit to 50 -200 millions files" were true in >> HDFS2, why are there so many production Hadoop clusters in real world? how >> can they mange well the issue of "Limit to 50-200 million files"? For >> instances, the Facebook's "Like" implementation runs on HBase at Web >> Scale, I can image HBase generates huge number of files in Facbook's Hadoop >> cluster, the number of files in Facebook's Hadoop cluster should be much >> much bigger than 50-200 million. >> >> From my point of view, in contrast, MaprFS should have true limitation up >> to 1T files while HDFS2 can handle true unlimited files, please do correct >> me if I am wrong. >> >> c) "Performance Bottleneck", again, is it really true? >> MaprFS does not have namenode in order to gain file system performance. >> If without Namenode, MaprFS would lose Data Locality which is one of the >> beauties of Hadoop If Data Locality is no longer available, any big data >> application running on MaprFS might gain some file system performance but >> it would totally lose the true gain of performance from Data Locality >> provided by Hadoop's namenode (gain small lose big) >> >> d) "Commercial NAS required" >> Is there any wiki/blog/discussion about Commercial NAS on Hadoop >> Federation? >> >> regards >> >> >> >> > > Want to work at Handy? Check out our culture deck and open roles > > Latest news at Handy > Handy just raised $50m > led > by Fidelity > > --001a114f0aaa67c11205348cc290 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
It is very common practice to backup the metadata in some = SAN store. So the idea of complete loss of all the metadata is preventable.= You could lose a day worth of data if e.g. you back the metadata once a da= y but you could do it more frequently. I'm not saying S3 or Azure Blob = are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <mtustin@handybook.co= m> wrote:
= The namenode architecture is a source of fragility in HDFS. While a high av= ailability deployment (with two namenodes, and a failover mechanism) means = you're unlikely to see service interruption, it is still possible to ha= ve a complete loss of filesystem metadata with the loss of two machines.
Secondly, because HDFS identifies datanodes by their hostn= ame/ip, dns changes can cause havoc with HDFS (see my war story on this her= e:=C2=A0https://medium.co= m/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aab= ab).

Also, the namenode/datanode architecture = probably does contribute to the small files problem being a problem. That s= aid, there are lot of practical solutions for the small files problem.=C2= =A0

If you're just setting up a data infrastru= cture, I would say consider alternatives before you pick HDFS. If you run i= n AWS, S3 is a good alternative. If you run in some other cloud, it's p= robably worth considering whatever their equivalent storage system is.

On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <ascot.moss@gmail.com> wrote:
Hi,

I read s= ome (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-arc= hitecture

It states that HDFS Federation has

a) &q= uot;Multiple Single Points of Failure", is it really true?=C2=A0
W= hy MapR uses HDFS but not HDFS2 in its comparison as this would lead to an = unfair comparison (or even misleading comparison)?=C2=A0 (HDFS was from Had= oop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is = no any Single Points of=C2=A0 Failure in HDFS2.

b) "Limit to 50= -200 million files", is it really true?
I have seen so many real w= orld Hadoop Clusters with over 10PB data, some even with 150PB data.=C2=A0 = If "Limit to 50 -200 millions files" were true in HDFS2, why are = there so many production Hadoop clusters in real world? how can they mange = well the issue of=C2=A0 "Limit to 50-200 million files"? For inst= ances,=C2=A0 the Facebook's "Like" implementation runs on HBa= se at Web Scale, I can image HBase generates huge number of files in Facboo= k's Hadoop cluster, the number of files in Facebook's Hadoop cluste= r should be much much bigger than 50-200 million.

From my= point of view, in contrast, MaprFS should have true limitation up to 1T fi= les while HDFS2 can handle true unlimited files, please do correct me if I = am wrong.

c) "Performance Bottleneck", ag= ain, is it really true?
MaprFS does not have namenode in order to gain f= ile system performance. If without Namenode, MaprFS would lose Data Localit= y which is one of the beauties of Hadoop=C2=A0 If Data Locality is no longe= r available, any big data application running on MaprFS might gain some fil= e system performance but it would totally lose the true gain of performance= from Data Locality provided by Hadoop's namenode (gain small lose big)=

d) "Commercial NAS required"
Is there any wiki/b= log/discussion about Commercial NAS on Hadoop Federation?

regards
=C2=A0




Want to= work at Handy? Check out our=C2=A0culture deck = and open roles
=
Latest= =C2=A0news=C2=A0at Handy
Handy=C2=A0just raised $5= 0m=C2=A0led by Fidelity


--001a114f0aaa67c11205348cc290--