hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Hbase and linear scaling with small write intensive clusters
Date Wed, 23 Sep 2009 00:17:13 GMT
(Funny, I read the 2MB as 2GB -- yeah, why so small Guy?)

On Tue, Sep 22, 2009 at 4:59 PM, Jonathan Gray <jlist@streamy.com> wrote:

> Is there a reason you have the split size set to 2MB?  That's rather small
> and you'll end up constantly splitting, even once you have good
> distribution.
>
> I'd go for pre-splitting, as others suggest, but with larger region sizes.
>
> Ryan Rawson wrote:
>
>> An interesting thing about HBase is it really performs better with
>> more data. Pre-splitting tables is one way.
>>
>> Another performance bottleneck is the write-ahead-log. You can disable
>> it by calling:
>> Put.setWriteToWAL(false);
>>
>> and you will achieve a substantial speedup.
>>
>> Good luck!
>> -ryan
>>
>> On Tue, Sep 22, 2009 at 3:39 PM, stack <stack@duboce.net> wrote:
>>
>>> Split your table in advance?  You can do it from the UI or shell (Script
>>> it?)
>>>
>>> Regards same performance for 10 nodes as for 5 nodes, how many regions in
>>> your table?  What happens if you pile on more data?
>>>
>>> The split algorithm will be sped up in coming versions for sure.  Two
>>> minutes seems like a long time.   Its under load at this time?
>>>
>>> St.Ack
>>>
>>>
>>>
>>> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <Guy.Molinari@disney.com
>>> >wrote:
>>>
>>>  Hello all,
>>>>
>>>>    I've been working with HBase for the past few months on a proof of
>>>> concept/technology adoption evaluation.    I wanted to describe my
>>>> scenario to the user/development community to get some input on my
>>>> observations.
>>>>
>>>>
>>>>
>>>> I've written an application that is comprised of two tables.  It models
>>>> a classic many-to-many relationship.   One table stores "User" data and
>>>> the other represents an "Inbox" of items assigned to that user.    The
>>>> key for the user is a string generated by the JDK's UUID.randomUUID()
>>>> method.   The key for the "Inbox" is a monotonically increasing value.
>>>>
>>>>
>>>>
>>>> It works just fine.   I've reviewed the performance tuning info on the
>>>> HBase WIKI page.   The client application spins up 100 threads each one
>>>> grabbing a range of keys (for the "Inbox").    The I/O mix is about
>>>> 50/50 read/write.   The test client inserts 1,000,000 "Inbox" items and
>>>> verifies the existence of a "User" (FK check).   It uses column families
>>>> to maintain integrity of the relationships.
>>>>
>>>>
>>>>
>>>> I'm running versions 0.19.3 and 0.20.0.    The behavior is basically the
>>>> same.   The cluster consists of 10 nodes.   I'm running my namenode and
>>>> HBase master on one dedicated box.   The other 9 run datanodes/region
>>>> servers.
>>>>
>>>>
>>>>
>>>> I'm seeing around ~1000 "Inbox" transactions per second (dividing total
>>>> time for the batch by total count inserted).    The problem is that I
>>>> get the same results with 5 nodes as with 10.    Not quite what I was
>>>> expecting.
>>>>
>>>>
>>>>
>>>> The bottleneck seems to be the splitting algorithms.   I've set my
>>>> region size to 2M.   I can see that as the process moves forward, HBase
>>>> pauses and re-distributes the data and splits regions.   It does this
>>>> first for the "Inbox" table and then pauses again and redistributes the
>>>> "User" table.    This pause can be quite long.   Often 2 minutes or
>>>> more.
>>>>
>>>>
>>>>
>>>> Can the key ranges be pre-defined somehow in advance to avoid this?   I
>>>> would rather not burden application developers/DBA's with this.
>>>> Perhaps the divvy algorithms could be sped up?   Any configuration
>>>> recommendations?
>>>>
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Guy
>>>>
>>>>
>>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message