hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Panshul Whisper <ouchwhis...@gmail.com>
Subject Re: Estimating disk space requirements
Date Fri, 18 Jan 2013 23:34:28 GMT
ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one
big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and
connected to the cluster.

Thanx for the intel :)


On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> It is usually better to not subdivide nodes into virtual nodes.  You will
> generally get better performance form the original node because you only
> pay for the OS once and because your disk I/O will be scheduled better.
>
> If you look at EC2 pricing, however, the spot market often has arbitrage
> opportunities where one size node is absurdly cheap relative to others.  In
> that case, it pays to scale the individual nodes up or down.
>
> The only reasonable reason to split nodes to very small levels is for
> testing and training.
>
>
> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ouchwhisper@gmail.com>wrote:
>
>> Thnx for the reply Ted,
>>
>> You can find 40 GB disks when u make virtual nodes on a cloud like
>> Rackspace ;-)
>>
>> About the os partitions I did not exactly understand what you meant.
>> I have made a server on the cloud.. And I just installed and configured
>> hadoop and hbase in the /use/local folder.
>> And I am pretty sure it does not have a separate partition for root.
>>
>> Please help me explain what u meant and what else precautions should I
>> take.
>>
>> Thanks,
>>
>> Regards,
>> Ouch Whisper
>> 01010101010
>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <tdunning@maprtech.com> wrote:
>>
>>> Where do you find 40gb disks now a days?
>>>
>>> Normally your performance is going to be better with more space but your
>>> network may be your limiting factor for some computations.  That could give
>>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>
>>> Keep in mind you also want to allow for an os partition. Current
>>> standard practice is to reserve as much as 100 GB for that partition but in
>>> your case 10gb better:-)
>>>
>>> Note that if you account for this, the node counts don't scale as
>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>
>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ouchwhisper@gmail.com>
>>> wrote:
>>>
>>> If we look at it with performance in mind,
>>> is it better to have 20 Nodes with 40 GB HDD
>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>
>>> they are connected on a gigabit LAN
>>>
>>> Thnx
>>>
>>>
>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>> jean-marc@spaggiari.org> wrote:
>>>
>>>> 20 nodes with 40 GB will do the work.
>>>>
>>>> After that you will have to consider performances based on your access
>>>> pattern. But that's another story.
>>>>
>>>> JM
>>>>
>>>> 2013/1/18, Panshul Whisper <ouchwhisper@gmail.com>:
>>>> > Thank you for the replies,
>>>> >
>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>> cluster). So
>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>> cluster. Will
>>>> > this be enough for the storage?
>>>> > Please confirm.
>>>> >
>>>> > Thanking You,
>>>> > Regards,
>>>> > Panshul.
>>>> >
>>>> >
>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>> > jean-marc@spaggiari.org> wrote:
>>>> >
>>>> >> Hi Panshul,
>>>> >>
>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>> >> 6.6GB available, not 11GB. You need to divide the total space by
the
>>>> >> replication factor.
>>>> >>
>>>> >> Also, if you store your JSon into HBase, you need to add the key
size
>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>> >>
>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>> to
>>>> >> store it. Without including the key size. Even with a replication
>>>> >> factor set to 5 you don't have the space.
>>>> >>
>>>> >> Now, you can add some compression, but even with a lucky factor
of
>>>> 50%
>>>> >> you still don't have the space. You will need something like 90%
>>>> >> compression factor to be able to store this data in your cluster.
>>>> >>
>>>> >> A 1T drive is now less than $100... So you might think about
>>>> replacing
>>>> >> you 20 GB drives by something bigger.
>>>> >> to reply to your last question, for your data here, you will need
AT
>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>> under
>>>> >> 500GB.
>>>> >>
>>>> >> IMHO
>>>> >>
>>>> >> JM
>>>> >>
>>>> >> 2013/1/18, Panshul Whisper <ouchwhisper@gmail.com>:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>> >> >
>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>> >> > the Json is to be stored into HBASE with some identifying data
in
>>>> >> coloumns
>>>> >> > and I also want to store the Json for later retrieval based
on the
>>>> Id
>>>> >> data
>>>> >> > as keys in Hbase.
>>>> >> > I have my HDFS replication set to 3
>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it..
so
>>>> approx
>>>> >> > 11
>>>> >> GB
>>>> >> > is available for use on my 20 GB node.
>>>> >> >
>>>> >> > I have no idea, if I have not enabled Hbase replication, is
the
>>>> HDFS
>>>> >> > replication enough to keep the data safe and redundant.
>>>> >> > How much total disk space I will need for the storage of the
data.
>>>> >> >
>>>> >> > Please help me estimate this.
>>>> >> >
>>>> >> > Thank you so much.
>>>> >> >
>>>> >> > --
>>>> >> > Regards,
>>>> >> > Ouch Whisper
>>>> >> > 010101010101
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>>
>


-- 
Regards,
Ouch Whisper
010101010101

Mime
View raw message