Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 67161 invoked from network); 3 Feb 2011 02:31:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 02:31:56 -0000 Received: (qmail 43698 invoked by uid 500); 3 Feb 2011 02:31:55 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 43521 invoked by uid 500); 3 Feb 2011 02:31:54 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 43513 invoked by uid 99); 3 Feb 2011 02:31:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 02:31:54 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gaurav.gs.sharma@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 02:31:49 +0000 Received: by wye20 with SMTP id 20so683804wye.35 for ; Wed, 02 Feb 2011 18:31:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=ritwh8D5ex+uyULkKpq4gSMEqOmJ+VlSsPqch0WuAlc=; b=KgBG3pMF9qEMDyFssc0nOCe6BVPZWM7hJBZ8JPUWEf5bGmOzJEXdttDb7hSC3jrOij XKEHHsNtDuTeEvQY6cuaxQGLx7SAhdVU3GsRdMnIfleP7BM/CugrBTJShpXOqZ3qLKmU AWuAuzQzNpVEbN1vMsRoEcmRusLJ3MvRp13uk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=wio2D36qOq3pGx7395F17GoW1iZ6FapUDv71dqnvHX2GMsq+EVs1DRTQp89bD2q+xQ v2iiZR9dTY0sBEhfbDo/nRV18h15vxP0lndMQgFIt1NqE/xP8bVS3GzcxdIWyOoT3jF8 1hVm6488juyAGhLMBsnrphKMzVq1k9DolVmLY= Received: by 10.216.187.71 with SMTP id x49mr10070501wem.111.1296700287562; Wed, 02 Feb 2011 18:31:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.7.213 with HTTP; Wed, 2 Feb 2011 18:31:07 -0800 (PST) In-Reply-To: <71150.50530.qm@web130206.mail.mud.yahoo.com> References: <71150.50530.qm@web130206.mail.mud.yahoo.com> From: Gaurav Sharma Date: Wed, 2 Feb 2011 21:31:07 -0500 Message-ID: Subject: Re: HDFS without Hadoop: Why? To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00148539328c05323a049b579045 --00148539328c05323a049b579045 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode= , you really cannot get a more authoritative number elsewhere :) I would do the back-of-envelope with ~160 bytes/file and ~150 bytes/block. On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith wrote: > > This is the best coverage I've seen from a source that would know: > > > http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_= hadoop_dist/ > > One relevant quote: > > To store 100 million files (referencing 200 million blocks), a name-node > should have at least 60 GB of RAM. > > But, honestly, if you're just building out your cluster, you'll probably > run into a lot of other limits first: hard drive space, regionserver memo= ry, > the infamous ulimit/xciever :), etc...the > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhruba Borthakur * wrote: > > > From: Dhruba Borthakur > > Subject: Re: HDFS without Hadoop: Why? > To: hdfs-user@hadoop.apache.org > Date: Wednesday, February 2, 2011, 9:00 PM > > > The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This > is a very rough calculation. > > dhruba > > On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay > > wrote: > > What you describe is pretty much my use case as well. Since I don=92t kn= ow > how big the number of files could get , I am trying to figure out if ther= e > is a theoretical design limitation in hdfs=85.. > > > > From what I have read, the name node will store all metadata of all files > in the RAM. Assuming (in my case), that a file is less than the configure= d > block size=85.there should be a very rough formula that can be used to > calculate the max number of files that hdfs can serve based on the > configured RAM on the name node? > > > > Can any of the implementers comment on this? Am I even thinking on the > right track=85? > > > > Thanks Ian for the haystack link=85very informative indeed. > > > > -Chinmay > > > > > > > > *From:* Stuart Smith [mailto:stu24mail@yahoo.com] > > *Sent:* Wednesday, February 02, 2011 4:41 PM > > *To:* hdfs-user@hadoop.apache.org > *Subject:* RE: HDFS without Hadoop: Why? > > > > Hello, > I'm actually using hbase/hadoop/hdfs for lots of small files (with a > long tail of larger files). Well, millions of small files - I don't know > what you mean by lots :) > > Facebook probably knows better, But what I do is: > > - store metadata in hbase > - files smaller than 10 MB or so in hbase > -larger files in a hdfs directory tree. > > I started storing 64 MB files and smaller in hbase (chunk size), but that > causes issues with regionservers when running M/R jobs. This is related t= o > the fact that I'm running a cobbled together cluster & my region servers > don't have that much memory. I would play the size to see what works for > you.. > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhodapkar, Chinmay > >* wrote: > > > From: Dhodapkar, Chinmay > > > Subject: RE: HDFS without Hadoop: Why? > To: "hdfs-user@hadoop.apache.org" > > > > Date: Wednesday, February 2, 2011, 7:28 PM > > Hello, > > > > I have been following this thread for some time now. I am very comfortabl= e > with the advantages of hdfs, but still have lingering questions about the > usage of hdfs for general purpose storage (no mapreduce/hbase etc). > > > > Can somebody shed light on what the limitations are on the number of file= s > that can be stored. Is it limited in anyway by the namenode? The use case= I > am interested in is to store a very large number of relatively small file= s > (1MB to 25MB). > > > > Interestingly, I saw a facebook presentation on how they use hbase/hdfs > internally. Them seem to store all metadata in hbase and the actual > images/files/etc in something called =93haystack=94 (why not use hdfs sin= ce they > already have it?). Anybody know what =93haystack=94 is? > > > > Thanks! > > Chinmay > > > > > > > > *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com] > > *Sent:* Wednesday, February 02, 2011 3:31 PM > *To:* hdfs-user@hadoop.apache.org > *Subject:* Re: HDFS without Hadoop: Why? > > > > > - Large block size wastes space for small file. The minimum file size > is 1 block. > > That's incorrect. If a file is smaller than the block size, it will onl= y > consume as much space as there is data in the file. > > > - There are no hardlinks, softlinks, or quotas. > > That's incorrect; there are quotas and softlinks. > > > > > > > -- > Connect to me at http://www.facebook.com/dhruba > > > --00148539328c05323a049b579045 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode= , you really cannot get a more authoritative number elsewhere :) I would do= the back-of-envelope with ~160 bytes/file and ~150 bytes/block.

On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <stu24mail@yahoo.c= om> wrote:

This is the best coverage I've se= en from a source that would know:

http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of= _the_hadoop_dist/

One relevant quote:

To store 100 million files (referencing 200 = million blocks), a name-node should have at least 60 GB of RAM.

But,= honestly, if you're just building out your cluster, you'll probabl= y run into a lot of other limits first: hard drive space, regionserver memo= ry, the infamous ulimit/xciever :), etc...the

Take care,
=A0 -stu

--- On Wed, 2/2/11, Dhruba Borthakur <= i><dhruba@gmail.co= m> wrote:

From: Dhruba Borthakur <dhruba@gmail.com>

Subject: Re: HDFS= without Hadoop: Why?
Date: Wednesday, February 2, 2011, 9:00 PM

=
The Namenode uses around 160 bytes/file and 150 bytes/block in HDF= S. This is a very rough calculation.

dhruba

On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <chinmayd@qualcomm.com> wrote:

What you describe = is pretty much my use case as well. Since I don=92t know how big the number= of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs=85..

=A0

From what I have r= ead, the name node will store all metadata of all files in the RAM. Assumin= g (in my case), that a file is less than the configured block size=85.there should be a very rough formula that can be used to calculate the max numbe= r of files that hdfs can serve based on the configured RAM on the name node= ?

=A0

Can any of the imp= lementers comment on this? Am I even thinking on the right track=85?=

=A0

Thanks Ian for the= haystack link=85very informative indeed.

=A0

-Chinmay

=A0

=A0

=A0

From: Stuart Smith [mailto:stu24mail@yahoo.com]
Sent: Wednesday, February 02, 2011 4:41 PM

Subject: RE: HDFS without Hadoop: Why?

=A0

Hello,
=A0=A0 I'm actually using hbase/hadoop/hdfs for lots of small files (wi= th a long tail of larger files). Well, millions of small files - I don'= t know what you mean by lots :)

Facebook probably knows better, But what I do is:

=A0 - store metadata in hbase
=A0 - files smaller than 10 MB or so in hbase
=A0=A0 -larger files in a hdfs directory tree.

I started storing 64 MB files and smaller in hbase (chunk size), but that c= auses issues with regionservers when running M/R jobs. This is related to t= he fact that I'm running a cobbled together cluster & my region ser= vers don't have that much memory. I would play the size to see what works for you..

Take care,
=A0=A0 -stu

--- On Wed, 2/2/11, Dhodapkar, Chinmay <chinmay= d@qualcomm.com> wrote:


From: Dhodapkar, Chinmay <chinmayd@qualcomm.com>=
Subject: RE: HDFS without Hadoop: Why?
To: "hdfs-user@hadoop.apache.org" <= hdfs-user@hadoop.apache.org>
Date: Wednesday, February 2, 2011, 7:28 PM

Hello,

=A0

I have been follow= ing this thread for some time now. I am very comfortable with the advantage= s of hdfs, but still have lingering questions about the usage of hdfs for g= eneral purpose storage (no mapreduce/hbase etc).

=A0

Can somebody shed = light on what the limitations are on the number of files that can be stored= . Is it limited in anyway by the namenode? The use case I am interested in = is to store a very large number of relatively small files (1MB to 25MB).

=A0

Interestingly, I s= aw a facebook presentation on how they use hbase/hdfs internally. Them seem= to store all metadata in hbase and the actual images/files/etc in somethin= g called =93haystack=94 (why not use hdfs since they already have it?). Anybody know what =93hayst= ack=94 is?

=A0

Thanks!

Chinmay

=A0

=A0

=A0

From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
Sent: Wednesday, February 02, 2011 3:31 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?

=A0

  • Large block size wastes space for small file. =A0The minimum file size is 1= block.

That's incorrect. If a file is smaller than the block size, it will = only consume as much space as there is data in the file.

  • There are no hardlinks, softlinks, or quotas.

That's incorrect; there are quotas and softlinks.

=A0




--
Connect to me at htt= p://www.facebook.com/dhruba


--00148539328c05323a049b579045--