hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Phelps <...@opendns.com>
Subject Re: Speeding up LoadIncrementalHFiles?
Date Thu, 31 Mar 2011 18:14:53 GMT
On 3/30/11 8:39 PM, Stack wrote:
> What is slow?  The running of the LoadIncrementHFiles or the copy?

Its the LoadIncrementHFiles portion.

> If
> the former, is it because the table its loading into has different
> boundaries than those of the HFiles so the HFiles have to be split?

I'm sure that could be one aspect of it, however from the logs it looks 
like <1% of the hfiles we're loading have to be split.  Looking at the 
code for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our 
problem is that this code loads the hfiles sequentially.  Our largest 
table has over 2500 regions and the data being loaded is fairly well 
distributed across them, so there end up being around 2500 HFiles for 
each load period.  At 1-2 seconds per HFile that means the loading 
process is very time consuming.

On the primary cluster (16 regionservers) one of this set of HFiles 
loads in ~350s vs ~3200s on the backup (with 4 regionservers).  Overall 
the nodes on the backup cluster are running at around 5% CPU (and 
similarly minimal disk and network usage).  So we have plenty of 
resources to throw at the problem, its just a matter of determining what 
we can do here other than adding additional nodes to the cluster.

My first thoughts are to try to add some parallelism, either by 
splitting the HFiles into multiple chunks for separate load instances, 
or to change LoadIncrementHFiles itself to use multiple loading threads.

> Is your data only coming in via bulk load?

Yes, everything we put into hbase is via bulk load.  We found it to be a 
huge improvement over doing individual Puts from the the M/R jobs.

- Adam

View raw message