Return-Path: Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: (qmail 71176 invoked from network); 8 Sep 2010 16:47:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Sep 2010 16:47:12 -0000 Received: (qmail 23371 invoked by uid 500); 8 Sep 2010 16:47:10 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 22880 invoked by uid 500); 8 Sep 2010 16:47:09 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 22634 invoked by uid 99); 8 Sep 2010 16:47:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Sep 2010 16:47:09 +0000 X-ASF-Spam-Status: No, hits=1.8 required=10.0 tests=FH_HELO_EQ_D_D_D_D,MIME_QP_LONG_LINE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 184.73.217.71 is neither permitted nor denied by domain of stack@duboce.net) Received: from [184.73.217.71] (HELO ip-10-202-7-187.ec2.internal) (184.73.217.71) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Sep 2010 16:47:01 +0000 Received: from ip-10-202-7-187.ec2.internal (localhost [127.0.0.1]) by ip-10-202-7-187.ec2.internal (Postfix) with ESMTP id 6B00B8A1F6; Wed, 8 Sep 2010 16:46:40 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: Review Request: Fix RPC deadlock when splitting regions on same RS as meta under heavy load From: stack@duboce.net To: stack@duboce.net Date: Wed, 08 Sep 2010 16:46:40 -0000 Message-ID: <20100908164640.6946.28213@ip-10-202-7-187.ec2.internal> Cc: "Todd Lipcon" , jiraposter@review.hbase.org, dev@hbase.apache.org In-Reply-To: <20100908013316.6946.90830@ip-10-202-7-187.ec2.internal> References: <20100908013316.6946.90830@ip-10-202-7-187.ec2.internal> X-Virus-Checked: Checked by ClamAV on apache.org > On 2010-09-07 18:33:16, Todd Lipcon wrote: > > src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.jav= a, line 207 > > > > > > maybe now we can do an: > > = > > assert !this.parent.lock.writeLock().isHeldByCurrentThread() : "Uns= afe to hold write lock while performing RPCs"; I'll add in this assert - stack ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/798/#review1122 ----------------------------------------------------------- On 2010-09-07 13:38:39, Todd Lipcon wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://review.cloudera.org/r/798/ > ----------------------------------------------------------- > = > (Updated 2010-09-07 13:38:39) > = > = > Review request for hbase and stack. > = > = > Summary > ------- > = > Moves all RPCs outside of the region writeLock - the writeLock is now onl= y used long enough to set the 'closing' flag. When we drop the lock any wai= ters will see 'closing' upon acquiring the lock, and thus throw NSRE. > = > In the case that we abort the split, it will reopen the region as before.= Accessors will have gotten NSRE but will just come back to the same region= eventually. > = > = > This addresses bug HBASE-2964. > http://issues.apache.org/jira/browse/HBASE-2964 > = > = > Diffs > ----- > = > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java a692125 = > src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.jav= a 3507c0d = > src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction= .java a245d97 = > = > Diff: http://review.cloudera.org/r/798/diff > = > = > Testing > ------- > = > YCSB testing on my cluster - it used to deadlock due to this bug within a= n hour. I ran a 5 hour load test overnight and it worked OK. > = > = > Thanks, > = > Todd > = >