Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 209.85.221.188 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=ImbDNe8Wd3tjvgeti82ouhiCe2f/iQGJfSeBbM/tdmNQCvY71sNtv7t5RySHyZm9wR
         prkE24sY07BXjycP7IstdFaKGrUHoFGHIgr0n7aamOhCscr56muFBvEtxQMZ2M0Z1aTT
         1ZsvI/4sLkzafE+S6BJGl0rw998TM3MPJtKE8=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <4AB964E9.7080203@streamy.com>
References: 
 <1435B7DFB9F60645965E92079D83979F026BEDA5@SM-CALA-VXMB03A.swna.wdpr.disney.com>
	 <7c962aed0909221539s253bd111t999f3a6c90c05a2c@mail.gmail.com>
	 <78568af10909221609t1223ac91kb0dd95001fbbe238@mail.gmail.com>
	 <4AB964E9.7080203@streamy.com>
Date: Tue, 22 Sep 2009 17:17:13 -0700
Message-ID: <7c962aed0909221717u7772073aq1fa5c2dbc3b122b8@mail.gmail.com>
Subject: Re: Hbase and linear scaling with small write intensive clusters
From: stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd5cc56f9cee0047433a2c6

--000e0cd5cc56f9cee0047433a2c6
Content-Type: text/plain; charset=ISO-8859-1

(Funny, I read the 2MB as 2GB -- yeah, why so small Guy?)

On Tue, Sep 22, 2009 at 4:59 PM, Jonathan Gray <jlist@streamy.com> wrote:

> Is there a reason you have the split size set to 2MB?  That's rather small
> and you'll end up constantly splitting, even once you have good
> distribution.
>
> I'd go for pre-splitting, as others suggest, but with larger region sizes.
>
> Ryan Rawson wrote:
>
>> An interesting thing about HBase is it really performs better with
>> more data. Pre-splitting tables is one way.
>>
>> Another performance bottleneck is the write-ahead-log. You can disable
>> it by calling:
>> Put.setWriteToWAL(false);
>>
>> and you will achieve a substantial speedup.
>>
>> Good luck!
>> -ryan
>>
>> On Tue, Sep 22, 2009 at 3:39 PM, stack <stack@duboce.net> wrote:
>>
>>> Split your table in advance?  You can do it from the UI or shell (Script
>>> it?)
>>>
>>> Regards same performance for 10 nodes as for 5 nodes, how many regions in
>>> your table?  What happens if you pile on more data?
>>>
>>> The split algorithm will be sped up in coming versions for sure.  Two
>>> minutes seems like a long time.   Its under load at this time?
>>>
>>> St.Ack
>>>
>>>
>>>
>>> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <Guy.Molinari@disney.com
>>> >wrote:
>>>
>>>  Hello all,
>>>>
>>>>    I've been working with HBase for the past few months on a proof of
>>>> concept/technology adoption evaluation.    I wanted to describe my
>>>> scenario to the user/development community to get some input on my
>>>> observations.
>>>>
>>>>
>>>>
>>>> I've written an application that is comprised of two tables.  It models
>>>> a classic many-to-many relationship.   One table stores "User" data and
>>>> the other represents an "Inbox" of items assigned to that user.    The
>>>> key for the user is a string generated by the JDK's UUID.randomUUID()
>>>> method.   The key for the "Inbox" is a monotonically increasing value.
>>>>
>>>>
>>>>
>>>> It works just fine.   I've reviewed the performance tuning info on the
>>>> HBase WIKI page.   The client application spins up 100 threads each one
>>>> grabbing a range of keys (for the "Inbox").    The I/O mix is about
>>>> 50/50 read/write.   The test client inserts 1,000,000 "Inbox" items and
>>>> verifies the existence of a "User" (FK check).   It uses column families
>>>> to maintain integrity of the relationships.
>>>>
>>>>
>>>>
>>>> I'm running versions 0.19.3 and 0.20.0.    The behavior is basically the
>>>> same.   The cluster consists of 10 nodes.   I'm running my namenode and
>>>> HBase master on one dedicated box.   The other 9 run datanodes/region
>>>> servers.
>>>>
>>>>
>>>>
>>>> I'm seeing around ~1000 "Inbox" transactions per second (dividing total
>>>> time for the batch by total count inserted).    The problem is that I
>>>> get the same results with 5 nodes as with 10.    Not quite what I was
>>>> expecting.
>>>>
>>>>
>>>>
>>>> The bottleneck seems to be the splitting algorithms.   I've set my
>>>> region size to 2M.   I can see that as the process moves forward, HBase
>>>> pauses and re-distributes the data and splits regions.   It does this
>>>> first for the "Inbox" table and then pauses again and redistributes the
>>>> "User" table.    This pause can be quite long.   Often 2 minutes or
>>>> more.
>>>>
>>>>
>>>>
>>>> Can the key ranges be pre-defined somehow in advance to avoid this?   I
>>>> would rather not burden application developers/DBA's with this.
>>>> Perhaps the divvy algorithms could be sped up?   Any configuration
>>>> recommendations?
>>>>
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Guy
>>>>
>>>>
>>>>
>>

--000e0cd5cc56f9cee0047433a2c6--