Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 95971 invoked from network); 3 Feb 2011 03:32:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 03:32:46 -0000 Received: (qmail 80548 invoked by uid 500); 3 Feb 2011 03:32:45 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 80391 invoked by uid 500); 3 Feb 2011 03:32:42 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 80383 invoked by uid 99); 3 Feb 2011 03:32:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 03:32:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [66.94.238.138] (HELO web130202.mail.mud.yahoo.com) (66.94.238.138) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 03 Feb 2011 03:32:34 +0000 Received: (qmail 7808 invoked by uid 60001); 3 Feb 2011 03:32:12 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1296703932; bh=e/0bPVq8V6EbbLW1d+RHAUDtRWo1NTokdxdnbAmEkQw=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=O5YKk1C2o8T7xO+7F/Cz0IM8AglBRm7sTR6+pZ8nnjlah6CadZvZajhiZopOTCubTdDHM1B6gWwKlgXIx9pTaddBN8IXemB89YurMeX1hp7jtAk67+pQU+/x7E17dCjNHvEIT3rGvzFYIbn8N1CwK4abTYXCj2oqWdPyn6GKIw8= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=03UAq9KRX2K/4xXC1WrLLUooBFnM787IwoEFPsqxxuj2tOv2ZHnbkduSia/PaGz1cM/uW2lzEX7IIFm2O3wJBTiy1uTijIsr0WeLIBftztGfTIyhgc6oTrgCeaL+yVJfZGUr2Bi6gu9Ez1ss9Rqc3bLdMCe1fiM27g4RbQirrPg=; Message-ID: <846640.6538.qm@web130202.mail.mud.yahoo.com> X-YMail-OSG: PGvjPCEVM1kodpkdC0kCDQut9mjqkun_vWnKnq1XDzuzEQ5 FJr1ooELPTRSnfEpfDXdi0.j.guNUDUgxKRUdfh0Kyq7Pc4sBjtRSwGbKt6R VdmbnsvrUWlfu8D.m.JxwDOHhYrMK4JQThPxOOitfGYoy2rjEQMH3pf4f.hW d6OVo8sHeO7J.eVdhGtOiWqldc6OrYIHR1fz.Qbykc4_bQD4L_.F1VFN_CNV 91KNf1CkYRB6vJj267zZUGYpMs1eLdzAdMFVmUnSy8s8_V7o8mowRukbVpC7 cR5kKDQ.IZKB8WIoCOXlPZyRrbelXUa11i96Xqf39IQitYZ3n6Z9035AX7Ex r.RrJ3sUXtgigTEsy4wawFo5VNmoWGDcmSy6GVDaTh3DpbRAyRG7VovxprwA eWMat5qN7UX4Q Received: from [71.160.74.233] by web130202.mail.mud.yahoo.com via HTTP; Wed, 02 Feb 2011 19:32:12 PST X-Mailer: YahooMailClassic/11.4.20 YahooMailWebService/0.8.108.291010 Date: Wed, 2 Feb 2011 19:32:12 -0800 (PST) From: Stuart Smith Subject: Re: HDFS without Hadoop: Why? To: hdfs-user@hadoop.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-983718856-1296703932=:6538" X-Virus-Checked: Checked by ClamAV on apache.org --0-983718856-1296703932=:6538 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > Stuart - if Dhruba is giving hdfs file and block sizes used by the =0Anam= enode, you really cannot get a more authoritative number elsewhere :)=0A=20 Yes - very true! :) I spaced out on the name there ... ;) One more thing - I believe that if you're storing a lot of your smaller fil= es in hbase, you'll end up with a lot less files on hdfs, since several of = your smaller files will end up in one HFile?? I'm storing 5-7 million files, with at least 70-80% ending up in hbase. I o= nly have 16 GB of RAM for my name-node, and it's very far from overloading = the memory. Off the top of my head, I think it's << 8 GB of RAM used...=20 Take care, =C2=A0 -stu --- On Wed, 2/2/11, Gaurav Sharma wrote: From: Gaurav Sharma Subject: Re: HDFS without Hadoop: Why? To: hdfs-user@hadoop.apache.org Date: Wednesday, February 2, 2011, 9:31 PM Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode= , you really cannot get a more authoritative number elsewhere :) I would do= the back-of-envelope with ~160 bytes/file and ~150 bytes/block. =0A=0AOn Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith wr= ote: =0A=0A This is the best coverage I've seen from a source that would know: http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_ha= doop_dist/ =0A=0A One relevant quote: To store 100 million files (referencing 200 million blocks), a name-node sh= ould have at least 60 GB of RAM. But, honestly, if you're just building out your cluster, you'll probably ru= n into a lot of other limits first: hard drive space, regionserver memory, = the infamous ulimit/xciever :), etc...the=20 =0A=0A Take care, =C2=A0 -stu --- On Wed, 2/2/11, Dhruba Borthakur wrote: =0A=0A From: Dhruba Borthakur Subject: Re: HDFS without Hadoop: Why? To: hdfs-user@hadoop.apache.org =0A=0ADate:=0A Wednesday, February 2, 2011, 9:00 PM The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This i= s a very rough calculation. dhruba =0A=0AOn Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay wrote: =0A=0A=0A=0A=0A=0A=0A=0A=0AWhat you describe is pretty much my use case as = well. Since I don=E2=80=99t know how big the number of files could get , I = am trying to figure out if there is a theoretical=0A design limitation in h= dfs=E2=80=A6..=0A=C2=A0=0AFrom what I have read, the name node will store a= ll metadata of all files in the RAM. Assuming (in my case), that a file is = less than the configured block size=E2=80=A6.there=0A should be a very roug= h formula that can be used to calculate the max number of files that hdfs c= an serve based on the configured RAM on the name node?=0A=C2=A0=0ACan any o= f the implementers comment on this? Am I even thinking on the right track= =E2=80=A6?=0A=C2=A0=0AThanks Ian for the haystack link=E2=80=A6very informa= tive indeed.=0A=C2=A0=0A-Chinmay=0A=C2=A0=0A=C2=A0=0A=C2=A0=0A=0AFrom: Stua= rt Smith [mailto:stu24mail@yahoo.com]=0A =0ASent: Wednesday, February 02, 2011 4:41 PM =0ATo: hdfs-user@hadoop.apache.org =0ASubject: RE: HDFS without Hadoop: Why?=0A=0A=C2=A0=0A=0A=0A=0A=0AHello, =0A=C2=A0=C2=A0 I'm actually using hbase/hadoop/hdfs for lots of small file= s (with a long tail of larger files). Well, millions of small files - I don= 't know what you mean by lots :)=0A =0A =0AFacebook probably knows better, But what I do is: =0A =0A=C2=A0 - store metadata in hbase =0A=C2=A0 - files smaller than 10 MB or so in hbase =0A=C2=A0=C2=A0 -larger files in a hdfs directory tree.=20 =0A =0AI started storing 64 MB files and smaller in hbase (chunk size), but tha= t causes issues with regionservers when running M/R jobs. This is related t= o the fact that I'm running a cobbled together cluster & my region servers = don't have that much memory. I would=0A play the size to see what works for= you.. =0A =0ATake care,=20 =0A=C2=A0=C2=A0 -stu =0A =0A--- On Wed, 2/2/11, Dhodapkar, Chinmay wrote:=0A =0AFrom: Dhodapkar, Chinmay =0ASubject: RE: HDFS without Hadoop: Why? =0ATo: "hdfs-user@hadoop.apache.org" =0A=0A=0ADate: Wednesday, February 2, 2011, 7:28 PM=0A=0A=0AHello,=0A=C2=A0= =0AI have been following this thread for some time now. I am very comfortab= le with the advantages of hdfs, but still have lingering questions about th= e usage of hdfs for general purpose=0A storage (no mapreduce/hbase etc).=0A= =C2=A0=0ACan somebody shed light on what the limitations are on the number = of files that can be stored. Is it limited in anyway by the namenode? The u= se case I am interested in is to store=0A a very large number of relatively= small files (1MB to 25MB).=0A=C2=A0=0AInterestingly, I saw a facebook pres= entation on how they use hbase/hdfs internally. Them seem to store all meta= data in hbase and the actual images/files/etc in something called =E2=80=9C= haystack=E2=80=9D=0A (why not use hdfs since they already have it?). Anybod= y know what =E2=80=9Chaystack=E2=80=9D is?=0A=C2=A0=0AThanks!=0AChinmay=0A= =C2=A0=0A=C2=A0=0A=C2=A0=0A=0AFrom: Jeff Hammerbacher [mailto:hammer@cloude= ra.com]=0A =0ASent: Wednesday, February 02, 2011 3:31 PM =0ATo: hdfs-user@hadoop.apache.org =0ASubject: Re: HDFS without Hadoop: Why?=0A=0A=C2=A0=0A=0A=0A=0A=0A=0A=0A= =0A=0ALarge block size wastes space for small file. =C2=A0The minimum file = size is 1 block.=0A=0A=0A=0A=0AThat's incorrect. If a file is smaller than = the block size, it will only consume as much space as there is data in the = file.=0A=0A=0A=0A=0A=0A=0A=0A=0AThere are no hardlinks, softlinks, or quota= s.=0A=0A=0A=0A=0AThat's incorrect; there are quotas and softlinks.=0A=0A=0A= =0A=0A=0A=0A=0A=0A=C2=A0=0A=0A=0A=0A --=20 Connect to me at http://www.facebook.com/dhruba =0A=0A =0A=0A =20 =0A=0A=0A=0A --0-983718856-1296703932=:6538 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable

> Stuart - if Dhruba is giving hdfs fi= le and block sizes used by the =0Anamenode, you really cannot get a more au= thoritative number elsewhere :)=0A

Yes - very true! :)

I spa= ced out on the name there ... ;)

One more thing - I believe that if = you're storing a lot of your smaller files in hbase, you'll end up with a l= ot less files on hdfs, since several of your smaller files will end up in o= ne HFile??

I'm storing 5-7 million files, with at least 70-80% endin= g up in hbase. I only have 16 GB of RAM for my name-node, and it's very far= from overloading the memory. Off the top of my head, I think it's <<= 8 GB of RAM used...


Take care,
  -stu

--- On Wed, 2/2/11, Gaurav Sharma <gaurav.gs.sharma@gmail.com> w= rote:

From: Gaurav Sharma <gaurav.gs.sha= rma@gmail.com>
Subject: Re: HDFS without Hadoop: Why?
To: hdfs-use= r@hadoop.apache.org
Date: Wednesday, February 2, 2011, 9:31 PM

Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode, you re= ally cannot get a more authoritative number elsewhere :) I would do the bac= k-of-envelope with ~160 bytes/file and ~150 bytes/block.

=0A=0A
On Wed, Feb 2, 2011 at 9:08 PM, Stuart Sm= ith <s= tu24mail@yahoo.com> wrote:
=0A=0A=0A=0A=0A

Hello,
=0A   I'm= actually using hbase/hadoop/hdfs for lots of small files (with a long tail= of larger files). Well, millions of small files - I don't know what you me= an by lots :)=0A
=0A
=0AFacebook probably knows better, But what I do= is:
=0A
=0A  - store metadata in hbase
=0A  - files sma= ller than 10 MB or so in hbase
=0A   -larger files in a hdfs d= irectory tree.
=0A
=0AI started storing 64 MB files and smaller in h= base (chunk size), but that causes issues with regionservers when running M= /R jobs. This is related to the fact that I'm running a cobbled together cl= uster & my region servers don't have that much memory. I would=0A play = the size to see what works for you..
=0A
=0ATake care,
=0A &= nbsp; -stu
=0A
=0A--- On Wed, 2/2/11, Dhodapkar, Chinmay <chinmayd@qualcomm.com> wrote:

=0A


=0AFrom: Dhodapkar, Chinmay <chinmayd@qualcomm.com>
=0ASubject: RE: HDFS without Hadoop: W= hy?
=0ATo: "hdfs-user@hadoop.apache.org" <= hdfs-user@hadoop.apache.org>
=0A=0A=0ADate= : Wednesday, February 2, 2011, 7:28 PM

=0A
=0A
=0A

Hello,

=0A

 

=0AI have been foll= owing this thread for some time now. I am very comfortable with the advanta= ges of hdfs, but still have lingering questions about the usage of hdfs for= general purpose=0A storage (no mapreduce/hbase etc).

=0A

 

=0ACan somebody she= d light on what the limitations are on the number of files that can be stor= ed. Is it limited in anyway by the namenode? The use case I am interested i= n is to store=0A a very large number of relatively small files (1MB to 25MB= ).

=0A

Interestingly, I saw a facebook presentation on how they use hbase/= hdfs internally. Them seem to store all metadata in hbase and the actual im= ages/files/etc in something called =E2=80=9Chaystack=E2=80=9D=0A (why not u= se hdfs since they already have it?). Anybody know what =E2=80=9Chaystack= =E2=80=9D is?

=0A

 

=0A

Thanks!

=0A

Chinmay

=0A

 

=0A

 

=0A

 

=0A
=0A

From: Jeff H= ammerbacher [mailto:hammer@cloudera.com]=0A
=0ASen= t: Wednesday, February 02, 2011 3:31 PM
=0ATo: hdfs-user@hadoop.apache.org
=0ASubject: Re: HDFS wit= hout Hadoop: Why?

=0A
=0A

 

=0A
=0A=0A=0A=0A
=0A
=0A
    =0A
  • = =0ALarge block size wastes space for small file.  The minimum file siz= e is 1 block.
=0A
=0A
=0A=0A
=0A

That= 's incorrect. If a file is smaller than the block size, it will only consum= e as much space as there is data in the file.

=0A
=0A
=0A=0A=0A
=0A
=0A
    =0A
  • =0AT= here are no hardlinks, softlinks, or quotas.
=0A
=0A
=0A=
=0A
=0A

That's incorrect; there are quotas and softlinks= .

=0A
=0A
=0A
=0A
=0A=0A
=0A=0A=0A

 

=0A= =0A=0A=0A


--
Co= nnect to me at http://www.facebook.com/dhruba
=0A=0A<= /div>

This is the best coverage I've seen from a source that would know= :

http://d= eveloper.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dis= t/
=0A=0A
One relevant quote:

To store 100 million files (= referencing 200 million blocks), a name-node should have at least 60 GB of = RAM.

But, honestly, if you're just building out your cluster, you'll= probably run into a lot of other limits first: hard drive space, regionser= ver memory, the infamous ulimit/xciever :), etc...the
=0A=0A
Take ca= re,
  -stu

--- On Wed, 2/2/11, Dhruba Borthakur <dhruba@gmail.com> wr= ote:
=0A=0A
From: Dhruba Borthakur <dhruba@gmail.com>

Subject: Re: HDFS without Hadoop: Why?
To: hdfs-user@hadoop.apache.org
=0A=0ADate:=0A= Wednesday, February 2, 2011, 9:00 PM


The Namenode uses around 160 bytes/file and 150 bytes/= block in HDFS. This is a very rough calculation.

dhruba<= br>
=0A=0AOn Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <chinmayd@qualcomm.com> wrote= :
=0A
=0A=0A=0A=0A=0A=0A
=0A
=0A

What you describe is pretty much my use case as well. Since I don=E2=80= =99t know how big the number of files could get , I am trying to figure out= if there is a theoretical=0A design limitation in hdfs=E2=80=A6..=0A

 

=0A

Fro= m what I have read, the name node will store all metadata of all files in t= he RAM. Assuming (in my case), that a file is less than the configured bloc= k size=E2=80=A6.there=0A should be a very rough formula that can be used to= calculate the max number of files that hdfs can serve based on the configu= red RAM on the name node?

=0A

 

=0A

Can any of the implementers comment on this?= Am I even thinking on the right track=E2=80=A6?

=0A

 

=0A

Thanks Ian for the ha= ystack link=E2=80=A6very informative indeed.

=0A

 

=0A

-Chinmay

=0A

=  

= =0A

 

=0A

 = ;

=0A
Subject: RE: HDFS without Hadoop: W= hy?

=0A
=0A

 

=0A=0A

=0A=0A

=0A

=0A=0A=0A=0A=0A= =0A=0A=0A --0-983718856-1296703932=:6538--