hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kranthi reddy <kranthili2...@gmail.com>
Subject Unexpected Data insertion time and Data size explosion
Date Sun, 04 Dec 2011 14:19:33 GMT
Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
and am trying to insert data. 3 of the machines are tasktrackers, with 4
map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each (100GB
txt file). The column structure is "rowID, word1, word2, word3".  My DFS
replication in hadoop and hbase is set to 3 each. I have put only one
column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution. To
complete 40% of the insertion, it has taken around 21 hrs and it's still
running. I have 12 map tasks running.* I would like to know is the
insertion time taken here on expected lines ??? Because when I used lucene,
I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
replication factor of 3 for HBase, I was expecting the table size inserted
to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
replicating the data 3 times and 50+ GB for additional storage
information). But even for 40% completion of data insertion, the space
occupied is around 550GB (Looks like it might take around 1.2TB for an
100GB file).* I have used the rowID to be a String, instead of Long. Will
that account for such rapid increase in data storage???


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message