Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C130CEC1C for ; Fri, 18 Jan 2013 22:36:35 +0000 (UTC) Received: (qmail 37613 invoked by uid 500); 18 Jan 2013 22:36:30 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 37527 invoked by uid 500); 18 Jan 2013 22:36:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 37504 invoked by uid 99); 18 Jan 2013 22:36:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 22:36:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dontariq@gmail.com designates 209.85.212.52 as permitted sender) Received: from [209.85.212.52] (HELO mail-vb0-f52.google.com) (209.85.212.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 22:36:26 +0000 Received: by mail-vb0-f52.google.com with SMTP id fa15so1188859vbb.39 for ; Fri, 18 Jan 2013 14:36:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=6qjmS1ImTKia/VwRkxvYKOoCBnBNEZBCOGE4IzUd+A0=; b=nJV+/EFIPTIjXm99C5Kp9parrcdutdZikTj5EPjW56E7Ii08/gDAb7jZqwqBZoW95T 3TxeXPrto9LiYwBpqPNm08J1DvmoHpzFiNw5I2hc8vXP06FqIDCwE/Xm9C0QRgHQwtqT oGyCaF19Qr0YRSASKx7SCa6HJWIkzyblOb11gukM+ulupdccPRLGh1cyxfLaqzzLbsr6 zSeb2hbfs2+4qMp/uW76atO4rtjxxFqFHPFuHV23rsZRA4O1aGRiKPxQcscNuPFgaRaX 3EHH2ApPJud4vqXeua7xbQAlj3OGanyWxtJ9sWybh0pzkVVtllCtf+8Ef7BHaq1tR5i3 QSBQ== X-Received: by 10.220.155.143 with SMTP id s15mr11324457vcw.35.1358548565339; Fri, 18 Jan 2013 14:36:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.58.34.16 with HTTP; Fri, 18 Jan 2013 14:35:25 -0800 (PST) In-Reply-To: References: <8EBC38C3-73A0-43F8-ADD5-102CDC593946@maprtech.com> From: Mohammad Tariq Date: Sat, 19 Jan 2013 04:05:25 +0530 Message-ID: Subject: Re: Estimating disk space requirements To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=f46d043be17ca5ae3b04d397bcea X-Virus-Checked: Checked by ClamAV on apache.org --f46d043be17ca5ae3b04d397bcea Content-Type: text/plain; charset=ISO-8859-1 You can attach a separate disk to your instance (for example an EBS volume in case of AWS), where you will be storing only Hadoop related stuff. And one disk for OS related stuff. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sat, Jan 19, 2013 at 4:00 AM, Panshul Whisper wrote: > Thnx for the reply Ted, > > You can find 40 GB disks when u make virtual nodes on a cloud like > Rackspace ;-) > > About the os partitions I did not exactly understand what you meant. > I have made a server on the cloud.. And I just installed and configured > hadoop and hbase in the /use/local folder. > And I am pretty sure it does not have a separate partition for root. > > Please help me explain what u meant and what else precautions should I > take. > > Thanks, > > Regards, > Ouch Whisper > 01010101010 > On Jan 18, 2013 11:11 PM, "Ted Dunning" wrote: > >> Where do you find 40gb disks now a days? >> >> Normally your performance is going to be better with more space but your >> network may be your limiting factor for some computations. That could give >> you some paradoxical scaling. Hbase will rarely show this behavior. >> >> Keep in mind you also want to allow for an os partition. Current standard >> practice is to reserve as much as 100 GB for that partition but in your >> case 10gb better:-) >> >> Note that if you account for this, the node counts don't scale as simply. >> The overhead of these os partitions goes up with number of nodes. >> >> On Jan 18, 2013, at 8:55 AM, Panshul Whisper >> wrote: >> >> If we look at it with performance in mind, >> is it better to have 20 Nodes with 40 GB HDD >> or is it better to have 10 Nodes with 80 GB HDD? >> >> they are connected on a gigabit LAN >> >> Thnx >> >> >> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari < >> jean-marc@spaggiari.org> wrote: >> >>> 20 nodes with 40 GB will do the work. >>> >>> After that you will have to consider performances based on your access >>> pattern. But that's another story. >>> >>> JM >>> >>> 2013/1/18, Panshul Whisper : >>> > Thank you for the replies, >>> > >>> > So I take it that I should have atleast 800 GB on total free space on >>> > HDFS.. (combined free space of all the nodes connected to the >>> cluster). So >>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster. >>> Will >>> > this be enough for the storage? >>> > Please confirm. >>> > >>> > Thanking You, >>> > Regards, >>> > Panshul. >>> > >>> > >>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari < >>> > jean-marc@spaggiari.org> wrote: >>> > >>> >> Hi Panshul, >>> >> >>> >> If you have 20 GB with a replication factor set to 3, you have only >>> >> 6.6GB available, not 11GB. You need to divide the total space by the >>> >> replication factor. >>> >> >>> >> Also, if you store your JSon into HBase, you need to add the key size >>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. >>> >> >>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to >>> >> store it. Without including the key size. Even with a replication >>> >> factor set to 5 you don't have the space. >>> >> >>> >> Now, you can add some compression, but even with a lucky factor of 50% >>> >> you still don't have the space. You will need something like 90% >>> >> compression factor to be able to store this data in your cluster. >>> >> >>> >> A 1T drive is now less than $100... So you might think about replacing >>> >> you 20 GB drives by something bigger. >>> >> to reply to your last question, for your data here, you will need AT >>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under >>> >> 500GB. >>> >> >>> >> IMHO >>> >> >>> >> JM >>> >> >>> >> 2013/1/18, Panshul Whisper : >>> >> > Hello, >>> >> > >>> >> > I was estimating how much disk space do I need for my cluster. >>> >> > >>> >> > I have 24 million JSON documents approx. 5kb each >>> >> > the Json is to be stored into HBASE with some identifying data in >>> >> coloumns >>> >> > and I also want to store the Json for later retrieval based on the >>> Id >>> >> data >>> >> > as keys in Hbase. >>> >> > I have my HDFS replication set to 3 >>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so >>> approx >>> >> > 11 >>> >> GB >>> >> > is available for use on my 20 GB node. >>> >> > >>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS >>> >> > replication enough to keep the data safe and redundant. >>> >> > How much total disk space I will need for the storage of the data. >>> >> > >>> >> > Please help me estimate this. >>> >> > >>> >> > Thank you so much. >>> >> > >>> >> > -- >>> >> > Regards, >>> >> > Ouch Whisper >>> >> > 010101010101 >>> >> > >>> >> >>> > >>> > >>> > >>> > -- >>> > Regards, >>> > Ouch Whisper >>> > 010101010101 >>> > >>> >> >> >> >> -- >> Regards, >> Ouch Whisper >> 010101010101 >> >> --f46d043be17ca5ae3b04d397bcea Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
You can attach a separate disk to your instance (for examp= le an=A0
EBS volume=A0in=A0case of AWS),=A0where you will be storing on= ly=A0
Hadoop related stuff. And one disk for OS related stuff.



On Sat, Jan 19, 2013 at 4:00 AM, Panshul= Whisper <ouchwhisper@gmail.com> wrote:

Thnx for the reply Ted,

You can find 40 GB disks when u make virtual nodes on a clou= d like Rackspace ;-)

About the os partitions I did not exactly understand what yo= u meant.
I have made a server on the cloud.. And I just installed and configured had= oop and hbase in the /use/local folder.
And I am pretty sure it does not have a separate partition for root.

Please help me explain what u meant and what else precaution= s should I take.

Thanks,

Regards,
Ouch Whisper
01010101010

On Jan 18, 2013 11:11 PM, "Ted Dunning"= ; <tdunning@m= aprtech.com> wrote:
Where do you find 40gb disks now a days?
<= br>
Normally your performance is going to be better with more spa= ce but your network may be your limiting factor for some computations. =A0T= hat could give you some paradoxical scaling. =A0Hbase will rarely show this= behavior.=A0

Keep in mind you also want to allow for an os partition= . Current standard practice is to reserve as much as 100 GB for that partit= ion but in your case 10gb better:-)

Note that if y= ou account for this, the node counts don't scale as simply. =A0The over= head of these os partitions goes up with number of nodes. =A0

On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ouchwhisper@gmail.com> wro= te:

If we look at it with performance in mind,=A0
is it better to have 20 Nodes with 40 GB HDD
or is it better to h= ave 10 Nodes with 80 GB HDD?

they are connected on= a gigabit LAN

Thnx


On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari= <jean-marc@spaggiari.org> wrote:
20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.

JM

2013/1/18, Panshul Whisper <ouchwhisper@gmail.com>:
> Thank you for the replies,
>
> So I take it that I should have atleast 800 GB on total free space on<= br> > HDFS.. (combined free space of all the nodes connected to the cluster)= . So
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster.= Will
> this be enough for the storage?
> Please confirm.
>
> Thanking You,
> Regards,
> Panshul.
>
>
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> jean-marc= @spaggiari.org> wrote:
>
>> Hi Panshul,
>>
>> If you have 20 GB with a replication factor set to 3, you have onl= y
>> 6.6GB available, not 11GB. You need to divide the total space by t= he
>> replication factor.
>>
>> Also, if you store your JSon into HBase, you need to add the key s= ize
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference= .
>>
>> So roughly, 24 000 000 * 5 * 1024 =3D 114GB. You don't have th= e space to
>> store it. Without including the key size. Even with a replication<= br> >> factor set to 5 you don't have the space.
>>
>> Now, you can add some compression, but even with a lucky factor of= 50%
>> you still don't have the space. You will need something like 9= 0%
>> compression factor to be able to store this data in your cluster.<= br> >>
>> A 1T drive is now less than $100... So you might think about repla= cing
>> you 20 GB drives by something bigger.
>> to reply to your last question, for your data here, you will need = AT
>> LEAST 350GB overall storage. But that's a bare minimum. Don= 9;t go under
>> 500GB.
>>
>> IMHO
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ouchwhisper@gmail.com>:
>> > Hello,
>> >
>> > I was estimating how much disk space do I need for my cluster= .
>> >
>> > I have 24 million JSON documents approx. 5kb each
>> > the Json is to be stored into HBASE with some identifying dat= a in
>> coloumns
>> > and I also want to store the Json for later retrieval based o= n the Id
>> data
>> > as keys in Hbase.
>> > I have my HDFS replication set to 3
>> > each node has Hadoop and hbase and Ubuntu installed on it.. s= o approx
>> > 11
>> GB
>> > is available for use on my 20 GB node.
>> >
>> > I have no idea, if I have not enabled Hbase replication, is t= he HDFS
>> > replication enough to keep the data safe and redundant.
>> > How much total disk space I will need for the storage of the = data.
>> >
>> > Please help me estimate this.
>> >
>> > Thank you so much.
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



--
=
Regards,
Ouch Whisper
010101010101

--f46d043be17ca5ae3b04d397bcea--