hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuru Jackson <samurujack...@googlemail.com>
Subject Re: HBase performace & bulk load
Date Fri, 23 Jul 2010 13:57:03 GMT

For testing purposes I have to make some bulk loads as well.

What I do is to insert the data in bulks (for instance 10.000 rows every time).

I create a Put List out of those records:

List<Put> pList = new ArrayList<Put>();

where each Put has WriteToWAL set to false;


Then I set autoflush to false and create a larger writebuffer:


The following settings have boosted my load performance 5times -
without any bigger performance tunings, no special HW  configuration I
achieve 8000-9000 records per second:


On Thu, Jul 22, 2010 at 6:31 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
> Yes, then you should really look at using the write buffer.
> J-D
> On Thu, Jul 22, 2010 at 3:22 PM, HAN LIU <hanl1@andrew.cmu.edu> wrote:
>> Thanks J-D.
>> The only place where I create an HTable is in the constructor of my Mapper.  The
constructor is called only once for each map task right?
>> Han
>> On Jul 22, 2010, at 4:43 PM, Jean-Daniel Cryans wrote:
>>> Han,
>>> This is bad, you must be doing something slow like creating a new
>>> HTable for each put call. Also you need to use the write buffer
>>> (disable auto flushing, then set the write buffer size on HTable
>>> during the map configuration) if since you manage the HTable yourself.
>>> The bulk load tool usage is wide-spread, you should give it a try if
>>> you only have 1 family.
>>> J-D
>>> On Thu, Jul 22, 2010 at 1:06 PM, HAN LIU <hanl1@andrew.cmu.edu> wrote:
>>>> Hi Guys,
>>>> I've been doing some data insertion from HDFS to HBase and the performance
seems to be really bad. It took about 3 hours to insert 15 GB of data.  The mapreduce job
is launched from one machine which grabs data from HDFS and insert them into an HTable located
at 3 other machines (1 master and 2 regionservers). There are 17 map job in total (no reduce
jobs), representing 17 files each about 1GB in size. The mapper simply extracts the useful
information from each of these files and insert them into HBase. In the end there are about
22 million rows added in the table, and with my implementation (pretty low-efficient I think),
for each of these row a 'table.put(Put p)' method is called once, so in the end there are
22 million 'table.put()' calls.
>>>> Does it make sense that these many 'table.put' calls talks 3 hours? Because
I have played with my code and I have determined that the bottleneck is these 'table.put()'
calls, because if I remove them, the rest of the code (doing every part of the job except
for committing the updates via 'table.put()' )only takes 2 minutes to run. I am really inexperienced
in HBase, so how do you guys usually do data insertion? What could be the tricks to enhance
>>>> I am thinking about using the bulk load feature to batch insert data into
HBase. Is this a popular method out there in the HBase community?
>>>> Really sorry about asking so much help for my problems but not helping other
people with theirs. I really would like to offer help once I get more experienced with HBase.
>>>> Thanks a lot in advance :)
>>>> ----
>>>> Han Liu
>>>> SCS & HCI Institute
>>>> Undergrad. Class of 2012
>>>> Carnegie Mellon University
>> Han Liu
>> SCS & HCI Institute
>> Undergrad. Class of 2012
>> Carnegie Mellon University

View raw message