hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mirko Kämpf <mirko.kae...@gmail.com>
Subject Re: Estimating disk space requirements
Date Fri, 18 Jan 2013 12:44:01 GMT

some comments are inside your message ...

2013/1/18 Panshul Whisper <ouchwhisper@gmail.com>

> Hello,
> I was estimating how much disk space do I need for my cluster.
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB is available for use on my 20 GB node.

11 GB is quite small  - or is there a typo?

The amount of raw data is about 115 GB
   *nr of items* *size of an item* *
* *Bytes* *GB*  24 1.00E+006 5 1.02E+003
122880000000 114.4409179688  (without additional key and metadata)

Depending in the amount of overhead this could be about 200GB x 3 is 600GB
just for distributed storage.

And than you need some capacity to store intermediate processing data (20%
to 30%) of the processed data is recommendet.

So you might prepare a capacity of 1TB or even more if your dataset grows.


> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.

The replication on the HDFS level is sufficient for keeping the data safe,
no need to replicate the HBase tables separately.

>  How much total disk space I will need for the storage of the data.

> Please help me estimate this.
> Thank you so much.
> --
> Regards,
> Ouch Whisper
> 010101010101

Best wishes

View raw message