hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Hbase and linear scaling with small write intensive clusters
Date Tue, 22 Sep 2009 23:09:41 GMT
An interesting thing about HBase is it really performs better with
more data. Pre-splitting tables is one way.

Another performance bottleneck is the write-ahead-log. You can disable
it by calling:
Put.setWriteToWAL(false);

and you will achieve a substantial speedup.

Good luck!
-ryan

On Tue, Sep 22, 2009 at 3:39 PM, stack <stack@duboce.net> wrote:
> Split your table in advance?  You can do it from the UI or shell (Script
> it?)
>
> Regards same performance for 10 nodes as for 5 nodes, how many regions in
> your table?  What happens if you pile on more data?
>
> The split algorithm will be sped up in coming versions for sure.  Two
> minutes seems like a long time.   Its under load at this time?
>
> St.Ack
>
>
>
> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <Guy.Molinari@disney.com>wrote:
>
>> Hello all,
>>
>>     I've been working with HBase for the past few months on a proof of
>> concept/technology adoption evaluation.    I wanted to describe my
>> scenario to the user/development community to get some input on my
>> observations.
>>
>>
>>
>> I've written an application that is comprised of two tables.  It models
>> a classic many-to-many relationship.   One table stores "User" data and
>> the other represents an "Inbox" of items assigned to that user.    The
>> key for the user is a string generated by the JDK's UUID.randomUUID()
>> method.   The key for the "Inbox" is a monotonically increasing value.
>>
>>
>>
>> It works just fine.   I've reviewed the performance tuning info on the
>> HBase WIKI page.   The client application spins up 100 threads each one
>> grabbing a range of keys (for the "Inbox").    The I/O mix is about
>> 50/50 read/write.   The test client inserts 1,000,000 "Inbox" items and
>> verifies the existence of a "User" (FK check).   It uses column families
>> to maintain integrity of the relationships.
>>
>>
>>
>> I'm running versions 0.19.3 and 0.20.0.    The behavior is basically the
>> same.   The cluster consists of 10 nodes.   I'm running my namenode and
>> HBase master on one dedicated box.   The other 9 run datanodes/region
>> servers.
>>
>>
>>
>> I'm seeing around ~1000 "Inbox" transactions per second (dividing total
>> time for the batch by total count inserted).    The problem is that I
>> get the same results with 5 nodes as with 10.    Not quite what I was
>> expecting.
>>
>>
>>
>> The bottleneck seems to be the splitting algorithms.   I've set my
>> region size to 2M.   I can see that as the process moves forward, HBase
>> pauses and re-distributes the data and splits regions.   It does this
>> first for the "Inbox" table and then pauses again and redistributes the
>> "User" table.    This pause can be quite long.   Often 2 minutes or
>> more.
>>
>>
>>
>> Can the key ranges be pre-defined somehow in advance to avoid this?   I
>> would rather not burden application developers/DBA's with this.
>> Perhaps the divvy algorithms could be sped up?   Any configuration
>> recommendations?
>>
>>
>>
>> Thanks in advance,
>>
>> Guy
>>
>>
>

Mime
View raw message