hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu, Ming (Ming)" <ming....@esgyn.cn>
Subject 答复: what is a good way to bulkload large amount of data into HBase table
Date Sun, 07 Feb 2016 03:03:39 GMT

It turns out my own bad. I wrongly create the 'store' table , so it is not evenly partitioned.
(First 99 ranges are within the first 1M, all others to the last range) Now I create a new
'store' file, with correct presplit range. Importtsv launch 100 reducer and took about 14
mins to finish the loading.  
This speed is good for me, so if the data volume is really big, one can presplit target table
with more regions and with more nodes in the cluster and get a scalable loading speed, I cannot
prove it by test due to the limitation of hardware, but I think this should be reasonable.

For now, 14 mins loading 135G raw data is not bad for me, about 600G/hr at a 10 nodes cluster.
Not very good, but acceptable, and I am counting on the scalability of HBase and MapReduce

Thanks Ted for sharing the info.

发件人: Ted Yu [mailto:yuzhihong@gmail.com] 
发送时间: 2016年2月6日 21:47
收件人: user@hbase.apache.org
主题: Re: what is a good way to bulkload large amount of data into HBase table

Can you describe how you used importtsv ?
Here is one related command line parameter:

      "By default importtsv will load data directly into HBase. To instead generate\n" +

      "HFiles of data to prepare for a bulk data load, pass the option:\n" +

      "  -D" + BULK_OUTPUT_CONF_KEY + "=/path/for/output\n" +

      "  Note: if you do not use this option, then the target table must already exist in
HBase\n" +

See also http://hbase.apache.org/book.html#arch.bulk.load.complete


On Sat, Feb 6, 2016 at 12:29 AM, Liu, Ming (Ming) <ming.liu@esgyn.cn> wrote:

> Hello,
> I am trying to find a good way to import large amount of data into 
> HBase from HDFS. I have a csv file about 135G originally, I put it 
> into HDFS, then I use HBase's importtsv utility to do a bulkload, for 
> that 135G original data, it took 40 mins. I have 10 nodes, each has 
> 128G, and all disk is SSD, 10G network. So this speed is not very good 
> from my humble opinion, since It took only 10 mins for me to put that 135G data into
> I assume Hive will be much faster , for external table, it even takes 
> no time to load. I will test it later.
> So I want to ask for help if anyone has some better ideas to do 
> bulkload in HBase? or importtsv is already the best tool to do 
> bulkload in HBase world?
> If I have real big-data (Say > 50T), this seems not a practical 
> loading speed, isn't it? Or it is ? In practice, how people load data 
> into HBase normally?
> Thanks in advance,
> Ming
View raw message