hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Hbase and linear scaling with small write intensive clusters
Date Tue, 22 Sep 2009 23:59:37 GMT
Is there a reason you have the split size set to 2MB?  That's rather 
small and you'll end up constantly splitting, even once you have good 
distribution.

I'd go for pre-splitting, as others suggest, but with larger region sizes.

Ryan Rawson wrote:
> An interesting thing about HBase is it really performs better with
> more data. Pre-splitting tables is one way.
> 
> Another performance bottleneck is the write-ahead-log. You can disable
> it by calling:
> Put.setWriteToWAL(false);
> 
> and you will achieve a substantial speedup.
> 
> Good luck!
> -ryan
> 
> On Tue, Sep 22, 2009 at 3:39 PM, stack <stack@duboce.net> wrote:
>> Split your table in advance?  You can do it from the UI or shell (Script
>> it?)
>>
>> Regards same performance for 10 nodes as for 5 nodes, how many regions in
>> your table?  What happens if you pile on more data?
>>
>> The split algorithm will be sped up in coming versions for sure.  Two
>> minutes seems like a long time.   Its under load at this time?
>>
>> St.Ack
>>
>>
>>
>> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <Guy.Molinari@disney.com>wrote:
>>
>>> Hello all,
>>>
>>>     I've been working with HBase for the past few months on a proof of
>>> concept/technology adoption evaluation.    I wanted to describe my
>>> scenario to the user/development community to get some input on my
>>> observations.
>>>
>>>
>>>
>>> I've written an application that is comprised of two tables.  It models
>>> a classic many-to-many relationship.   One table stores "User" data and
>>> the other represents an "Inbox" of items assigned to that user.    The
>>> key for the user is a string generated by the JDK's UUID.randomUUID()
>>> method.   The key for the "Inbox" is a monotonically increasing value.
>>>
>>>
>>>
>>> It works just fine.   I've reviewed the performance tuning info on the
>>> HBase WIKI page.   The client application spins up 100 threads each one
>>> grabbing a range of keys (for the "Inbox").    The I/O mix is about
>>> 50/50 read/write.   The test client inserts 1,000,000 "Inbox" items and
>>> verifies the existence of a "User" (FK check).   It uses column families
>>> to maintain integrity of the relationships.
>>>
>>>
>>>
>>> I'm running versions 0.19.3 and 0.20.0.    The behavior is basically the
>>> same.   The cluster consists of 10 nodes.   I'm running my namenode and
>>> HBase master on one dedicated box.   The other 9 run datanodes/region
>>> servers.
>>>
>>>
>>>
>>> I'm seeing around ~1000 "Inbox" transactions per second (dividing total
>>> time for the batch by total count inserted).    The problem is that I
>>> get the same results with 5 nodes as with 10.    Not quite what I was
>>> expecting.
>>>
>>>
>>>
>>> The bottleneck seems to be the splitting algorithms.   I've set my
>>> region size to 2M.   I can see that as the process moves forward, HBase
>>> pauses and re-distributes the data and splits regions.   It does this
>>> first for the "Inbox" table and then pauses again and redistributes the
>>> "User" table.    This pause can be quite long.   Often 2 minutes or
>>> more.
>>>
>>>
>>>
>>> Can the key ranges be pre-defined somehow in advance to avoid this?   I
>>> would rather not burden application developers/DBA's with this.
>>> Perhaps the divvy algorithms could be sped up?   Any configuration
>>> recommendations?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Guy
>>>
>>>
> 

Mime
View raw message