hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu, Ming (Ming)" <ming....@esgyn.cn>
Subject 答复: what is a good way to bulkload large amount of data into HBase table
Date Sat, 06 Feb 2016 15:08:52 GMT
Thanks Ted for the help,

I use the -Dimporttsv.bulk.output=output option. The table is a test table, with 100 columns
and I randomly generated 50M rows of data. The first column is used as rowkey, which is an
unique sequence number.
That 50M rows are saved as a csv file, which is 135G in size. I put it into HDFS, it took
about 10 mins.

Then I use following command to invoke the importtsv:

hadoop jar /usr/lib/hbase/lib/hbase-server-1.0.0-cdh5.4.4.jar importtsv '-Dimporttsv.separator=|'
-Dimporttsv.bulk.output=output -Dimporttsv.columns=HBASE_ROW_KEY,f:c2,f:c3,f:c4,f:c5,f:c6,f:c7,f:c8,f:c9,f:c10,f:c11,f:c12,f:c13,f:c14,f:c15,f:c16,f:c17,f:c18,f:c19,f:c20,f:c21,f:c22,f:c23,f:c24,f:c25,f:c26,f:c27,f:c28,f:c29,f:c30,f:c31,f:c32,f:c33,f:c34,f:c35,f:c36,f:c37,f:c38,f:c39,f:c40,f:c41,f:c42,f:c43,f:c44,f:c45,f:c46,f:c47,f:c48,f:c49,f:c50,f:c51,f:c52,f:c53,f:c54,f:c55,f:c56,f:c57,f:c58,f:c59,f:c60,f:c61,f:c62,f:c63,f:c64,f:c65,f:c66,f:c67,f:c68,f:c69,f:c70,f:c71,f:c72,f:c73,f:c74,f:c75,f:c76,f:c77,f:c78,f:c79,f:c80,f:c81,f:c82,f:c83,f:c84,f:c85,f:c86,f:c87,f:c88,f:c89,f:c90,f:c91,f:c92,f:c93,f:c94,f:c95,f:c96,f:c97,f:c98,f:c99,f:c100
store /bulkload/perftest/bltest1.csv

it took about 40 mins, and then I change the owner of HDFS dir 'output' to hbase.hbase. And
invoke the complete command as this:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles output store
this took 1 second or so, very fast. But the total loading time is 40 mins in the first step.

the table 'store' is previous created in hbase shell with 10 pre-split regions, since the
first column to be rowkey is a sequence, so I can evenly break the whole table into 10 regions.

That is it.

Now, I remove the importtsv.bulk.output option, it runs for 1 hour, but just finish 30%, so
I am sure it is slower. And since there is no 'output' dir to put the HFile, I don't need
to invoke the LoadIncrementalHFiles in this case. Will check tomorrow, too late here.

And, I want to ask help from the community if someone has previous experience of loading 100G
or even more data into HBase, and how he/she did it, and what is the average loading speed.
As a developer, I don't have any real project experience, just do my experiment in our lab.
It looks too slow for me, but maybe that is a normal loading speed... So I want to hear from
experts in this community. 


发件人: Ted Yu [mailto:yuzhihong@gmail.com] 
发送时间: 2016年2月6日 21:47
收件人: user@hbase.apache.org
主题: Re: what is a good way to bulkload large amount of data into HBase table

Can you describe how you used importtsv ?
Here is one related command line parameter:

      "By default importtsv will load data directly into HBase. To instead generate\n" +

      "HFiles of data to prepare for a bulk data load, pass the option:\n" +

      "  -D" + BULK_OUTPUT_CONF_KEY + "=/path/for/output\n" +

      "  Note: if you do not use this option, then the target table must already exist in
HBase\n" +

See also http://hbase.apache.org/book.html#arch.bulk.load.complete


On Sat, Feb 6, 2016 at 12:29 AM, Liu, Ming (Ming) <ming.liu@esgyn.cn> wrote:

> Hello,
> I am trying to find a good way to import large amount of data into 
> HBase from HDFS. I have a csv file about 135G originally, I put it 
> into HDFS, then I use HBase's importtsv utility to do a bulkload, for 
> that 135G original data, it took 40 mins. I have 10 nodes, each has 
> 128G, and all disk is SSD, 10G network. So this speed is not very good 
> from my humble opinion, since It took only 10 mins for me to put that 135G data into
> I assume Hive will be much faster , for external table, it even takes 
> no time to load. I will test it later.
> So I want to ask for help if anyone has some better ideas to do 
> bulkload in HBase? or importtsv is already the best tool to do 
> bulkload in HBase world?
> If I have real big-data (Say > 50T), this seems not a practical 
> loading speed, isn't it? Or it is ? In practice, how people load data 
> into HBase normally?
> Thanks in advance,
> Ming
View raw message