Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 80602 invoked from network); 23 Sep 2009 00:17:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Sep 2009 00:17:42 -0000 Received: (qmail 53936 invoked by uid 500); 23 Sep 2009 00:17:42 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 53866 invoked by uid 500); 23 Sep 2009 00:17:42 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 53856 invoked by uid 99); 23 Sep 2009 00:17:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Sep 2009 00:17:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com designates 209.85.221.188 as permitted sender) Received: from [209.85.221.188] (HELO mail-qy0-f188.google.com) (209.85.221.188) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Sep 2009 00:17:34 +0000 Received: by qyk26 with SMTP id 26so243722qyk.5 for ; Tue, 22 Sep 2009 17:17:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=qj16YLcgkPa/aMWiOf2S7CT1vshVa4W2ozbzcRgvNlc=; b=Mw+rv1vnv6nNYleWVivYmR5eLGPEpu6JmoEHiCpSq+OLBkHvpVXN0SZ345/m4bBt6O tjcqcONpGRjOk11fQ5DuE/JpcMTQcjAU8shG0TLfDMu370FjOmWq/1d/KbCRqnAuHNDi fiMugSfK9qywkB0jx8o2kqbxNnNIDUt3SBzFM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=ImbDNe8Wd3tjvgeti82ouhiCe2f/iQGJfSeBbM/tdmNQCvY71sNtv7t5RySHyZm9wR prkE24sY07BXjycP7IstdFaKGrUHoFGHIgr0n7aamOhCscr56muFBvEtxQMZ2M0Z1aTT 1ZsvI/4sLkzafE+S6BJGl0rw998TM3MPJtKE8= MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.229.119.131 with SMTP id z3mr729523qcq.37.1253665033276; Tue, 22 Sep 2009 17:17:13 -0700 (PDT) In-Reply-To: <4AB964E9.7080203@streamy.com> References: <1435B7DFB9F60645965E92079D83979F026BEDA5@SM-CALA-VXMB03A.swna.wdpr.disney.com> <7c962aed0909221539s253bd111t999f3a6c90c05a2c@mail.gmail.com> <78568af10909221609t1223ac91kb0dd95001fbbe238@mail.gmail.com> <4AB964E9.7080203@streamy.com> Date: Tue, 22 Sep 2009 17:17:13 -0700 X-Google-Sender-Auth: ab062a118d118e5a Message-ID: <7c962aed0909221717u7772073aq1fa5c2dbc3b122b8@mail.gmail.com> Subject: Re: Hbase and linear scaling with small write intensive clusters From: stack To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd5cc56f9cee0047433a2c6 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd5cc56f9cee0047433a2c6 Content-Type: text/plain; charset=ISO-8859-1 (Funny, I read the 2MB as 2GB -- yeah, why so small Guy?) On Tue, Sep 22, 2009 at 4:59 PM, Jonathan Gray wrote: > Is there a reason you have the split size set to 2MB? That's rather small > and you'll end up constantly splitting, even once you have good > distribution. > > I'd go for pre-splitting, as others suggest, but with larger region sizes. > > Ryan Rawson wrote: > >> An interesting thing about HBase is it really performs better with >> more data. Pre-splitting tables is one way. >> >> Another performance bottleneck is the write-ahead-log. You can disable >> it by calling: >> Put.setWriteToWAL(false); >> >> and you will achieve a substantial speedup. >> >> Good luck! >> -ryan >> >> On Tue, Sep 22, 2009 at 3:39 PM, stack wrote: >> >>> Split your table in advance? You can do it from the UI or shell (Script >>> it?) >>> >>> Regards same performance for 10 nodes as for 5 nodes, how many regions in >>> your table? What happens if you pile on more data? >>> >>> The split algorithm will be sped up in coming versions for sure. Two >>> minutes seems like a long time. Its under load at this time? >>> >>> St.Ack >>> >>> >>> >>> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy >> >wrote: >>> >>> Hello all, >>>> >>>> I've been working with HBase for the past few months on a proof of >>>> concept/technology adoption evaluation. I wanted to describe my >>>> scenario to the user/development community to get some input on my >>>> observations. >>>> >>>> >>>> >>>> I've written an application that is comprised of two tables. It models >>>> a classic many-to-many relationship. One table stores "User" data and >>>> the other represents an "Inbox" of items assigned to that user. The >>>> key for the user is a string generated by the JDK's UUID.randomUUID() >>>> method. The key for the "Inbox" is a monotonically increasing value. >>>> >>>> >>>> >>>> It works just fine. I've reviewed the performance tuning info on the >>>> HBase WIKI page. The client application spins up 100 threads each one >>>> grabbing a range of keys (for the "Inbox"). The I/O mix is about >>>> 50/50 read/write. The test client inserts 1,000,000 "Inbox" items and >>>> verifies the existence of a "User" (FK check). It uses column families >>>> to maintain integrity of the relationships. >>>> >>>> >>>> >>>> I'm running versions 0.19.3 and 0.20.0. The behavior is basically the >>>> same. The cluster consists of 10 nodes. I'm running my namenode and >>>> HBase master on one dedicated box. The other 9 run datanodes/region >>>> servers. >>>> >>>> >>>> >>>> I'm seeing around ~1000 "Inbox" transactions per second (dividing total >>>> time for the batch by total count inserted). The problem is that I >>>> get the same results with 5 nodes as with 10. Not quite what I was >>>> expecting. >>>> >>>> >>>> >>>> The bottleneck seems to be the splitting algorithms. I've set my >>>> region size to 2M. I can see that as the process moves forward, HBase >>>> pauses and re-distributes the data and splits regions. It does this >>>> first for the "Inbox" table and then pauses again and redistributes the >>>> "User" table. This pause can be quite long. Often 2 minutes or >>>> more. >>>> >>>> >>>> >>>> Can the key ranges be pre-defined somehow in advance to avoid this? I >>>> would rather not burden application developers/DBA's with this. >>>> Perhaps the divvy algorithms could be sped up? Any configuration >>>> recommendations? >>>> >>>> >>>> >>>> Thanks in advance, >>>> >>>> Guy >>>> >>>> >>>> >> --000e0cd5cc56f9cee0047433a2c6--