hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yuzhih...@gmail.com
Subject Re: Unexpected Data insertion time and Data size explosion
Date Sun, 04 Dec 2011 19:51:19 GMT
May I ask whether you pre-split your table before loading ?



On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2020@gmail.com> wrote:

> Hi all,
> 
>    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
> and am trying to insert data. 3 of the machines are tasktrackers, with 4
> map tasks each.
> 
>    My data consists of about 1.3 billion rows with 4 columns each (100GB
> txt file). The column structure is "rowID, word1, word2, word3".  My DFS
> replication in hadoop and hbase is set to 3 each. I have put only one
> column family and 3 qualifiers for each field (word*).
> 
>    I am using the SampleUploader present in the HBase distribution. To
> complete 40% of the insertion, it has taken around 21 hrs and it's still
> running. I have 12 map tasks running.* I would like to know is the
> insertion time taken here on expected lines ??? Because when I used lucene,
> I was able to insert the entire data in about 8 hours.*
> 
>    Also, there seems to be huge explosion of data size here. With a
> replication factor of 3 for HBase, I was expecting the table size inserted
> to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
> replicating the data 3 times and 50+ GB for additional storage
> information). But even for 40% completion of data insertion, the space
> occupied is around 550GB (Looks like it might take around 1.2TB for an
> 100GB file).* I have used the rowID to be a String, instead of Long. Will
> that account for such rapid increase in data storage???
> *
> 
> Regards,
> Kranthi

Mime
View raw message