Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAOUOv0HJ6bch6f94yQAtbdcYASysn+WUGg0+mXHv_K+skAeUHg@mail.gmail.com>
References: <CAGiZwvPtzYrT0GQuWvnN6oJ4eLVEE-T5QO3Vkw+pwCWOz+c2DA@mail.gmail.com>
 <0D85B8BC-5D45-474E-A7B7-C21AA2F816CF@gmail.com> <CAOUOv0HJ6bch6f94yQAtbdcYASysn+WUGg0+mXHv_K+skAeUHg@mail.gmail.com>
From: Ascot Moss <ascot.moss@gmail.com>
Date: Sun, 5 Jun 2016 16:14:32 +0800
Message-ID: <CAGiZwvOGkW4SmAam2O9k8Zho4=m3jMbMtQT7ZQ1UQ+c6XfwnmQ@mail.gmail.com>
Subject: Re: HDFS2 vs MaprFS
To: daemeon reiydelle <daemeonr@gmail.com>
Cc: Gavin Yue <yue.yuanyuan@gmail.com>, user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a114eae5cb206a30534838bcd
archived-at: Sun, 05 Jun 2016 08:14:47 -0000

--001a114eae5cb206a30534838bcd
Content-Type: text/plain; charset=UTF-8

Will the the common pool of datanodes and namenode federation be a more
effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <daemeonr@gmail.com>
wrote:

> There are indeed many tuning points here. If the name nodes and journal
> nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can
> easily scale. I did have one client where the file counts forced multiple
> clusters. But we were able to differentiate by airframe types ... eg fixed
> wing in one, rotary subsonic in another, etc.
>
> sent from my mobile
> Daemeon C.M. Reiydelle
> USA 415.501.0198
> London +44.0.20.8144.9872
> On Jun 4, 2016 2:23 PM, "Gavin Yue" <yue.yuanyuan@gmail.com> wrote:
>
>> Here is what I found on Horton website.
>>
>>
>> *Namespace scalability*
>>
>> While HDFS cluster storage scales horizontally with the addition of
>> datanodes, the namespace does not. Currently the namespace can only be
>> vertically scaled on a single namenode.  The namenode stores the entire
>> file system metadata in memory. This limits the number of blocks, files,
>> and directories supported on the file system to what can be accommodated in
>> the memory of a single namenode. A typical large deployment at Yahoo!
>> includes an HDFS cluster with 2700-4200 datanodes with 180 million files
>> and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around
>> 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage.
>> While these are very large systems and good enough for majority of Hadoop
>> users, a few deployments that might want to grow even larger could find the
>> namespace scalability limiting.
>>
>>
>>
>> On Jun 4, 2016, at 04:43, Ascot Moss <ascot.moss@gmail.com> wrote:
>>
>> Hi,
>>
>> I read some (old?) articles from Internet about Mapr-FS vs HDFS.
>>
>> https://www.mapr.com/products/m5-features/no-namenode-architecture
>>
>> It states that HDFS Federation has
>>
>> a) "Multiple Single Points of Failure", is it really true?
>> Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to
>> an unfair comparison (or even misleading comparison)?  (HDFS was from
>> Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there
>> is no any Single Points of  Failure in HDFS2.
>>
>> b) "Limit to 50-200 million files", is it really true?
>> I have seen so many real world Hadoop Clusters with over 10PB data, some
>> even with 150PB data.  If "Limit to 50 -200 millions files" were true in
>> HDFS2, why are there so many production Hadoop clusters in real world? how
>> can they mange well the issue of  "Limit to 50-200 million files"? For
>> instances,  the Facebook's "Like" implementation runs on HBase at Web
>> Scale, I can image HBase generates huge number of files in Facbook's Hadoop
>> cluster, the number of files in Facebook's Hadoop cluster should be much
>> much bigger than 50-200 million.
>>
>> From my point of view, in contrast, MaprFS should have true limitation up
>> to 1T files while HDFS2 can handle true unlimited files, please do correct
>> me if I am wrong.
>>
>> c) "Performance Bottleneck", again, is it really true?
>> MaprFS does not have namenode in order to gain file system performance.
>> If without Namenode, MaprFS would lose Data Locality which is one of the
>> beauties of Hadoop  If Data Locality is no longer available, any big data
>> application running on MaprFS might gain some file system performance but
>> it would totally lose the true gain of performance from Data Locality
>> provided by Hadoop's namenode (gain small lose big)
>>
>> d) "Commercial NAS required"
>> Is there any wiki/blog/discussion about Commercial NAS on Hadoop
>> Federation?
>>
>> regards
>>
>>
>>
>>

--001a114eae5cb206a30534838bcd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Will the the common pool of datanodes and namenode federat=
ion be a more effective alternative in HDFS2 =C2=A0than multiple clusters?<=
/div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sun, Jun =
5, 2016 at 12:19 PM, daemeon reiydelle <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:daemeonr@gmail.com" target=3D"_blank">daemeonr@gmail.com</a>&gt;</span>=
 wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bor=
der-left:1px #ccc solid;padding-left:1ex"><p dir=3D"ltr">There are indeed m=
any tuning points here. If the name nodes and journal nodes can be larger, =
perhaps even bonding multiple 10gbyte nics, one can easily scale. I did hav=
e one client where the file counts forced multiple clusters. But we were ab=
le to differentiate by airframe types ... eg fixed wing in one, rotary subs=
onic in another, etc.</p>
<p dir=3D"ltr">sent from my mobile<br>
Daemeon C.M. Reiydelle<br>
USA <a href=3D"tel:415.501.0198" value=3D"+14155010198" target=3D"_blank">4=
15.501.0198</a><br>
London <a href=3D"tel:%2B44.0.20.8144.9872" value=3D"+442081449872" target=
=3D"_blank">+44.0.20.8144.9872</a></p><div class=3D"HOEnZb"><div class=3D"h=
5">
<div class=3D"gmail_quote">On Jun 4, 2016 2:23 PM, &quot;Gavin Yue&quot; &l=
t;<a href=3D"mailto:yue.yuanyuan@gmail.com" target=3D"_blank">yue.yuanyuan@=
gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"auto"><div></div><div>Here is what I found on Horton websi=
te. =C2=A0</div><div><br></div><div><br></div><div><p style=3D"margin:0px 0=
px 14px;padding:0px;list-style:none;border:0px;line-height:20px"><strong st=
yle=3D"margin:0px;padding:0px;list-style:none;background-color:rgba(255,255=
,255,0)">Namespace scalability</strong></p><p style=3D"margin:0px 0px 14px;=
padding:0px;list-style:none;border:0px;line-height:20px"><span style=3D"bac=
kground-color:rgba(255,255,255,0)">While HDFS cluster storage scales horizo=
ntally with the addition of datanodes, the namespace does not. Currently th=
e namespace can only be vertically scaled on a single namenode.=C2=A0 The n=
amenode stores the entire file system metadata in memory. This limits the n=
umber of blocks, files, and directories supported on the file system to wha=
t can be accommodated in the memory of a single namenode. A typical large d=
eployment at Yahoo! includes an HDFS cluster with=C2=A0<a href=3D"tel:2700-=
4200" style=3D"margin:0px;padding:0px;list-style:none;border:0px;text-decor=
ation:none" target=3D"_blank">2700-4200</a>=C2=A0datanodes with 180 million=
 files and blocks, and address ~25 PB of storage.=C2=A0 At Facebook, HDFS h=
as around 2600 nodes, 300 million files and blocks, addressing up to 60PB o=
f storage. While these are very large systems and good enough for majority =
of Hadoop users, a few deployments that might want to grow even larger coul=
d find the namespace scalability limiting.</span></p><p style=3D"margin:0px=
 0px 14px;padding:0px;list-style:none;border:0px;line-height:20px"><span st=
yle=3D"background-color:rgba(255,255,255,0)"><br></span></p><p style=3D"mar=
gin:0px 0px 14px;padding:0px;list-style:none;border:0px;line-height:20px"><=
span style=3D"background-color:rgba(255,255,255,0)"><br></span></p>On Jun 4=
, 2016, at 04:43, Ascot Moss &lt;<a href=3D"mailto:ascot.moss@gmail.com" ta=
rget=3D"_blank">ascot.moss@gmail.com</a>&gt; wrote:<br><br></div><blockquot=
e type=3D"cite"><div><div dir=3D"ltr"><div><div><div><div><div><div>Hi,<br>=
<br></div>I read some (old?) articles from Internet about Mapr-FS vs HDFS. =
<br><br></div><a href=3D"https://www.mapr.com/products/m5-features/no-namen=
ode-architecture" target=3D"_blank">https://www.mapr.com/products/m5-featur=
es/no-namenode-architecture</a><br><br></div>It states that HDFS Federation=
 has <br><br>a) &quot;Multiple Single Points of Failure&quot;, is it really=
 true?=C2=A0 <br>Why MapR uses HDFS but not HDFS2 in its comparison as this=
 would lead to an unfair comparison (or even misleading comparison)?=C2=A0 =
(HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 201=
3-10-15, there is no any Single Points of=C2=A0 Failure in HDFS2.<br><br>b)=
 &quot;Limit to 50-200 million files&quot;, is it really true? <br>I have s=
een so many real world Hadoop Clusters with over 10PB data, some even with =
150PB data.=C2=A0 If &quot;Limit to 50 -200 millions files&quot; were true =
in HDFS2, why are there so many production Hadoop clusters in real world? h=
ow can they mange well the issue of=C2=A0 &quot;Limit to 50-200 million fil=
es&quot;? For instances,=C2=A0 the Facebook&#39;s &quot;Like&quot; implemen=
tation runs on HBase at Web Scale, I can image HBase generates huge number =
of files in Facbook&#39;s Hadoop cluster, the number of files in Facebook&#=
39;s Hadoop cluster should be much much bigger than 50-200 million.<br><br>=
</div><div>From my point of view, in contrast, MaprFS should have true limi=
tation up to 1T files while HDFS2 can handle true unlimited files, please d=
o correct me if I am wrong.<br></div><div><br></div>c) &quot;Performance Bo=
ttleneck&quot;, again, is it really true?<br>MaprFS does not have namenode =
in order to gain file system performance. If without Namenode, MaprFS would=
 lose Data Locality which is one of the beauties of Hadoop=C2=A0 If Data Lo=
cality is no longer available, any big data application running on MaprFS m=
ight gain some file system performance but it would totally lose the true g=
ain of performance from Data Locality provided by Hadoop&#39;s namenode (ga=
in small lose big)<br><br></div>d) &quot;Commercial NAS required&quot;<br>I=
s there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?=
<br><br></div><div>regards<br></div>=C2=A0<br><div><div><br><br></div></div=
></div>
</div></blockquote></div></blockquote></div>
</div></div></blockquote></div><br></div>

--001a114eae5cb206a30534838bcd--