Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of gaurav.gs.sharma@gmail.com
 designates 74.125.82.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=wio2D36qOq3pGx7395F17GoW1iZ6FapUDv71dqnvHX2GMsq+EVs1DRTQp89bD2q+xQ
         v2iiZR9dTY0sBEhfbDo/nRV18h15vxP0lndMQgFIt1NqE/xP8bVS3GzcxdIWyOoT3jF8
         1hVm6488juyAGhLMBsnrphKMzVq1k9DolVmLY=
MIME-Version: 1.0
In-Reply-To: <71150.50530.qm@web130206.mail.mud.yahoo.com>
References: <AANLkTi=y6WKM8Nz=myEy9pAz8qA0s058QE1Tboig_8Wm@mail.gmail.com>
 <71150.50530.qm@web130206.mail.mud.yahoo.com>
From: Gaurav Sharma <gaurav.gs.sharma@gmail.com>
Date: Wed, 2 Feb 2011 21:31:07 -0500
Message-ID: <AANLkTikKjw1xAeJWztMrHPwoFCfBGYSVG0jeigA1a2r6@mail.gmail.com>
Subject: Re: HDFS without Hadoop: Why?
To: hdfs-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00148539328c05323a049b579045

--00148539328c05323a049b579045
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode=
,
you really cannot get a more authoritative number elsewhere :) I would do
the back-of-envelope with ~160 bytes/file and ~150 bytes/block.

On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <stu24mail@yahoo.com> wrote:

>
> This is the best coverage I've seen from a source that would know:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_=
hadoop_dist/
>
> One relevant quote:
>
> To store 100 million files (referencing 200 million blocks), a name-node
> should have at least 60 GB of RAM.
>
> But, honestly, if you're just building out your cluster, you'll probably
> run into a lot of other limits first: hard drive space, regionserver memo=
ry,
> the infamous ulimit/xciever :), etc...the
>
> Take care,
>   -stu
>
> --- On *Wed, 2/2/11, Dhruba Borthakur <dhruba@gmail.com>* wrote:
>
>
> From: Dhruba Borthakur <dhruba@gmail.com>
>
> Subject: Re: HDFS without Hadoop: Why?
> To: hdfs-user@hadoop.apache.org
> Date: Wednesday, February 2, 2011, 9:00 PM
>
>
> The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This
> is a very rough calculation.
>
> dhruba
>
> On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <chinmayd@qualcomm.com=
<http://mc/compose?to=3Dchinmayd@qualcomm.com>
> > wrote:
>
>  What you describe is pretty much my use case as well. Since I don=92t kn=
ow
> how big the number of files could get , I am trying to figure out if ther=
e
> is a theoretical design limitation in hdfs=85..
>
>
>
> From what I have read, the name node will store all metadata of all files
> in the RAM. Assuming (in my case), that a file is less than the configure=
d
> block size=85.there should be a very rough formula that can be used to
> calculate the max number of files that hdfs can serve based on the
> configured RAM on the name node?
>
>
>
> Can any of the implementers comment on this? Am I even thinking on the
> right track=85?
>
>
>
> Thanks Ian for the haystack link=85very informative indeed.
>
>
>
> -Chinmay
>
>
>
>
>
>
>
> *From:* Stuart Smith [mailto:stu24mail@yahoo.com<http://mc/compose?to=3Ds=
tu24mail@yahoo.com>]
>
> *Sent:* Wednesday, February 02, 2011 4:41 PM
>
> *To:* hdfs-user@hadoop.apache.org<http://mc/compose?to=3Dhdfs-user@hadoop=
.apache.org>
> *Subject:* RE: HDFS without Hadoop: Why?
>
>
>
> Hello,
>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a
> long tail of larger files). Well, millions of small files - I don't know
> what you mean by lots :)
>
> Facebook probably knows better, But what I do is:
>
>   - store metadata in hbase
>   - files smaller than 10 MB or so in hbase
>    -larger files in a hdfs directory tree.
>
> I started storing 64 MB files and smaller in hbase (chunk size), but that
> causes issues with regionservers when running M/R jobs. This is related t=
o
> the fact that I'm running a cobbled together cluster & my region servers
> don't have that much memory. I would play the size to see what works for
> you..
>
> Take care,
>    -stu
>
> --- On *Wed, 2/2/11, Dhodapkar, Chinmay <chinmayd@qualcomm.com<http://mc/=
compose?to=3Dchinmayd@qualcomm.com>
> >* wrote:
>
>
> From: Dhodapkar, Chinmay <chinmayd@qualcomm.com<http://mc/compose?to=3Dch=
inmayd@qualcomm.com>
> >
> Subject: RE: HDFS without Hadoop: Why?
> To: "hdfs-user@hadoop.apache.org<http://mc/compose?to=3Dhdfs-user@hadoop.=
apache.org>"
> <hdfs-user@hadoop.apache.org<http://mc/compose?to=3Dhdfs-user@hadoop.apac=
he.org>
> >
> Date: Wednesday, February 2, 2011, 7:28 PM
>
> Hello,
>
>
>
> I have been following this thread for some time now. I am very comfortabl=
e
> with the advantages of hdfs, but still have lingering questions about the
> usage of hdfs for general purpose storage (no mapreduce/hbase etc).
>
>
>
> Can somebody shed light on what the limitations are on the number of file=
s
> that can be stored. Is it limited in anyway by the namenode? The use case=
 I
> am interested in is to store a very large number of relatively small file=
s
> (1MB to 25MB).
>
>
>
> Interestingly, I saw a facebook presentation on how they use hbase/hdfs
> internally. Them seem to store all metadata in hbase and the actual
> images/files/etc in something called =93haystack=94 (why not use hdfs sin=
ce they
> already have it?). Anybody know what =93haystack=94 is?
>
>
>
> Thanks!
>
> Chinmay
>
>
>
>
>
>
>
> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com<http://mc/compose?t=
o=3Dhammer@cloudera.com>]
>
> *Sent:* Wednesday, February 02, 2011 3:31 PM
> *To:* hdfs-user@hadoop.apache.org<http://mc/compose?to=3Dhdfs-user@hadoop=
.apache.org>
> *Subject:* Re: HDFS without Hadoop: Why?
>
>
>
>
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>
>   That's incorrect. If a file is smaller than the block size, it will onl=
y
> consume as much space as there is data in the file.
>
>
>    - There are no hardlinks, softlinks, or quotas.
>
>   That's incorrect; there are quotas and softlinks.
>
>
>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>
>
>

--00148539328c05323a049b579045
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode=
, you really cannot get a more authoritative number elsewhere :) I would do=
 the back-of-envelope with ~160 bytes/file and ~150 bytes/block.<br><br>

<div class=3D"gmail_quote">On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:stu24mail@yahoo.com">stu24mail@yahoo.c=
om</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<table border=3D"0" cellpadding=3D"0" cellspacing=3D"0"><tbody><tr><td styl=
e=3D"font:inherit" valign=3D"top"><br>This is the best coverage I&#39;ve se=
en from a source that would know:<br><br><a href=3D"http://developer.yahoo.=
com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/" target=3D"_=
blank">http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of=
_the_hadoop_dist/</a><br>

<br>One relevant quote:<br><br>To store 100 million files (referencing 200 =
million blocks), a name-node should have at least 60 GB of RAM.<br><br>But,=
 honestly, if you&#39;re just building out your cluster, you&#39;ll probabl=
y run into a lot of other limits first: hard drive space, regionserver memo=
ry, the infamous ulimit/xciever :), etc...the <br>

<br>Take care,<br>=A0 -stu<br><br>--- On <b>Wed, 2/2/11, Dhruba Borthakur <=
i>&lt;<a href=3D"mailto:dhruba@gmail.com" target=3D"_blank">dhruba@gmail.co=
m</a>&gt;</i></b> wrote:<br><blockquote style=3D"border-left:2px solid rgb(=
16, 16, 255);margin-left:5px;padding-left:5px">

<br>From: Dhruba Borthakur &lt;<a href=3D"mailto:dhruba@gmail.com" target=
=3D"_blank">dhruba@gmail.com</a>&gt;<div class=3D"im"><br>Subject: Re: HDFS=
 without Hadoop: Why?<br></div><div class=3D"im">To: <a href=3D"mailto:hdfs=
-user@hadoop.apache.org" target=3D"_blank">hdfs-user@hadoop.apache.org</a><=
br>

</div>Date:
 Wednesday, February 2, 2011, 9:00 PM<div><div></div><div class=3D"h5"><br>=
<br><div>The Namenode uses around 160 bytes/file and 150 bytes/block in HDF=
S. This is a very rough calculation.<div><br></div><div>dhruba<br><br><div>

On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <span dir=3D"ltr">&lt;<a=
 rel=3D"nofollow" href=3D"http://mc/compose?to=3Dchinmayd@qualcomm.com" tar=
get=3D"_blank">chinmayd@qualcomm.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204=
, 204, 204);padding-left:1ex">


<div lang=3D"EN-US">
<div>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">What you describe =
is pretty much my use case as well. Since I don=92t know how big the number=
 of files could get , I am trying to figure out if there is a theoretical
 design limitation in hdfs=85..</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">From what I have r=
ead, the name node will store all metadata of all files in the RAM. Assumin=
g (in my case), that a file is less than the configured block size=85.there
 should be a very rough formula that can be used to calculate the max numbe=
r of files that hdfs can serve based on the configured RAM on the name node=
?</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Can any of the imp=
lementers comment on this? Am I even thinking on the right track=85?</span>=
</p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Thanks Ian for the=
 haystack link=85very informative indeed.</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">-Chinmay</span></p=
>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<div style=3D"border-right:medium none;border-width:1pt medium medium;borde=
r-style:solid none none;border-color:rgb(181, 196, 223) -moz-use-text-color=
 -moz-use-text-color;padding:3pt 0in 0in">
<p><b><span style=3D"font-size:10pt">From:</span></b><span style=3D"font-si=
ze:10pt"> Stuart Smith [mailto:<a rel=3D"nofollow" href=3D"http://mc/compos=
e?to=3Dstu24mail@yahoo.com" target=3D"_blank">stu24mail@yahoo.com</a>]
<br>
<b>Sent:</b> Wednesday, February 02, 2011 4:41 PM</span></p><div><br>
<b>To:</b> <a rel=3D"nofollow" href=3D"http://mc/compose?to=3Dhdfs-user@had=
oop.apache.org" target=3D"_blank">hdfs-user@hadoop.apache.org</a><br>
</div><div><div></div><div><b>Subject:</b> RE: HDFS without Hadoop: Why?</d=
iv></div><p></p>
</div><div><div></div><div>
<p>=A0</p>
<table border=3D"0" cellpadding=3D"0" cellspacing=3D"0">
<tbody>
<tr>
<td style=3D"padding:0in" valign=3D"top">
<p>Hello,<br>
=A0=A0 I&#39;m actually using hbase/hadoop/hdfs for lots of small files (wi=
th a long tail of larger files). Well, millions of small files - I don&#39;=
t know what you mean by lots :)
<br>
<br>
Facebook probably knows better, But what I do is:<br>
<br>
=A0 - store metadata in hbase<br>
=A0 - files smaller than 10 MB or so in hbase<br>
=A0=A0 -larger files in a hdfs directory tree. <br>
<br>
I started storing 64 MB files and smaller in hbase (chunk size), but that c=
auses issues with regionservers when running M/R jobs. This is related to t=
he fact that I&#39;m running a cobbled together cluster &amp; my region ser=
vers don&#39;t have that much memory. I would
 play the size to see what works for you..<br>
<br>
Take care, <br>
=A0=A0 -stu<br>
<br>
--- On <b>Wed, 2/2/11, Dhodapkar, Chinmay <i>&lt;<a rel=3D"nofollow" href=
=3D"http://mc/compose?to=3Dchinmayd@qualcomm.com" target=3D"_blank">chinmay=
d@qualcomm.com</a>&gt;</i></b> wrote:</p>
<p style=3D"margin-bottom:12pt"><br>
From: Dhodapkar, Chinmay &lt;<a rel=3D"nofollow" href=3D"http://mc/compose?=
to=3Dchinmayd@qualcomm.com" target=3D"_blank">chinmayd@qualcomm.com</a>&gt;=
<br>
Subject: RE: HDFS without Hadoop: Why?<br>
To: &quot;<a rel=3D"nofollow" href=3D"http://mc/compose?to=3Dhdfs-user@hado=
op.apache.org" target=3D"_blank">hdfs-user@hadoop.apache.org</a>&quot; &lt;=
<a rel=3D"nofollow" href=3D"http://mc/compose?to=3Dhdfs-user@hadoop.apache.=
org" target=3D"_blank">hdfs-user@hadoop.apache.org</a>&gt;<br>


Date: Wednesday, February 2, 2011, 7:28 PM</p>
<div>
<div>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Hello,</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">I have been follow=
ing this thread for some time now. I am very comfortable with the advantage=
s of hdfs, but still have lingering questions about the usage of hdfs for g=
eneral purpose
 storage (no mapreduce/hbase etc).</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Can somebody shed =
light on what the limitations are on the number of files that can be stored=
. Is it limited in anyway by the namenode? The use case I am interested in =
is to store
 a very large number of relatively small files (1MB to 25MB).</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Interestingly, I s=
aw a facebook presentation on how they use hbase/hdfs internally. Them seem=
 to store all metadata in hbase and the actual images/files/etc in somethin=
g called =93haystack=94
 (why not use hdfs since they already have it?). Anybody know what =93hayst=
ack=94 is?</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Thanks!</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">Chinmay</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<p><span style=3D"font-size:11pt;color:rgb(31, 73, 125)">=A0</span></p>
<div style=3D"border-right:medium none;border-width:1pt medium medium;borde=
r-style:solid none none;border-color:rgb(181, 196, 223) -moz-use-text-color=
 -moz-use-text-color;padding:3pt 0in 0in">
<p><b><span style=3D"font-size:10pt">From:</span></b><span style=3D"font-si=
ze:10pt"> Jeff Hammerbacher [mailto:<a rel=3D"nofollow" href=3D"http://mc/c=
ompose?to=3Dhammer@cloudera.com" target=3D"_blank">hammer@cloudera.com</a>]
<br>
<b>Sent:</b> Wednesday, February 02, 2011 3:31 PM<br>
<b>To:</b> <a rel=3D"nofollow" href=3D"http://mc/compose?to=3Dhdfs-user@had=
oop.apache.org" target=3D"_blank">hdfs-user@hadoop.apache.org</a><br>
<b>Subject:</b> Re: HDFS without Hadoop: Why?</span></p>
</div>
<p>=A0</p>
<div>
<blockquote style=3D"border-width:medium medium medium 1pt;border-style:non=
e none none solid;border-color:-moz-use-text-color -moz-use-text-color -moz=
-use-text-color rgb(204, 204, 204);padding:0in 0in 0in 6pt;margin:5pt 0in 5=
pt 4.8pt">


<div>
<div>
<ul type=3D"disc">
<li>
Large block size wastes space for small file. =A0The minimum file size is 1=
 block.</li></ul>
</div>
</div>
</blockquote>
<div>
<p>That&#39;s incorrect. If a file is smaller than the block size, it will =
only consume as much space as there is data in the file.</p>
</div>
<blockquote style=3D"border-width:medium medium medium 1pt;border-style:non=
e none none solid;border-color:-moz-use-text-color -moz-use-text-color -moz=
-use-text-color rgb(204, 204, 204);padding:0in 0in 0in 6pt;margin:5pt 0in 5=
pt 4.8pt">


<div>
<div>
<ul type=3D"disc">
<li>
There are no hardlinks, softlinks, or quotas.</li></ul>
</div>
</div>
</blockquote>
<div>
<p>That&#39;s incorrect; there are quotas and softlinks.</p>
</div>
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
<p><span style=3D"font-size:10pt">=A0</span></p>
</div></div></div>
</div>

</blockquote></div><br><br clear=3D"all"><br>-- <br>Connect to me at <a rel=
=3D"nofollow" href=3D"http://www.facebook.com/dhruba" target=3D"_blank">htt=
p://www.facebook.com/dhruba</a><br>
</div>
</div></div></div></blockquote></td></tr></tbody></table><br>

      </blockquote></div><br>

--00148539328c05323a049b579045--