Return-Path: Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: (qmail 24301 invoked from network); 26 Oct 2010 06:26:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Oct 2010 06:26:06 -0000 Received: (qmail 44412 invoked by uid 500); 26 Oct 2010 06:26:06 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 43979 invoked by uid 500); 26 Oct 2010 06:26:03 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 43963 invoked by uid 99); 26 Oct 2010 06:26:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Oct 2010 06:26:02 +0000 X-ASF-Spam-Status: No, hits=1.8 required=10.0 tests=FH_HELO_EQ_D_D_D_D,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 184.73.217.71 is neither permitted nor denied by domain of stack@duboce.net) Received: from [184.73.217.71] (HELO ip-10-202-7-187.ec2.internal) (184.73.217.71) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Oct 2010 06:25:57 +0000 Received: from ip-10-202-7-187.ec2.internal (localhost [127.0.0.1]) by ip-10-202-7-187.ec2.internal (Postfix) with ESMTP id 80B148A206; Tue, 26 Oct 2010 06:25:36 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: Review Request: Add separate handling of PENDING_OPEN/PENDING_CLOSE in timeout monitor and additional testing From: stack@duboce.net To: stack@duboce.net Date: Tue, 26 Oct 2010 06:25:36 -0000 Message-ID: <20101026062536.785.63779@ip-10-202-7-187.ec2.internal> Cc: "Jonathan Gray" , jiraposter@review.hbase.org, dev@hbase.apache.org In-Reply-To: <20101025232936.786.15925@ip-10-202-7-187.ec2.internal> References: <20101025232936.786.15925@ip-10-202-7-187.ec2.internal> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/ ----------------------------------------------------------- (Updated 2010-10-25 23:25:36.390570) Review request for hbase and stack. Changes ------- So, a few things extra after digging in w/ Jon. 1. A watch was not being called on .META. move because it was not being set= ; in MetaNodeTracker we were not calling the super inside in nodeDeleted to= reset the watch (in rolling restart, only a few servers would actually exp= erience a moved .META. sensation and it was these that were hanging up.. O= thers when they came up would see .META. in its new location) 2. We were not assigning out .META. if master had trouble reaching meta bef= ore it saw server expired. In the case where we'd trouble contacting meta = before we saw its server expire, we'd reset in the catalog tracker its loca= tion. We were using catalog tracker to determine which server was hosting = meta. We use a different technique now. Summary ------- Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-mem= ory master RIT states. Adds some new broken RIT states into TestMasterFailover. Some of these broken states don't seem possible to me but as long as we are= n't breaking the existing behaviors and tests I think it's okay if we handl= e odd cases that can be mocked. Who knows what will happen in the real wor= ld. The reason TestMasterFailover didn't/doesn't really test for the issue in H= BASE-3147 is this new broken condition happens when an RS dies / goes offli= ne rather than a master failover concurrent w/ RS failure. v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for= root and another for meta so the processing of servers hosting meta/root d= o not get frozen out. I've seen this in my testing. This addresses bug HBASE-3147. http://issues.apache.org/jira/browse/HBASE-3147 Diffs (updated) ----- trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1= 027351 = trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 10273= 51 = trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 10= 27351 = trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java= 1027351 = trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java= 1027351 = trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351 = trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 102= 7351 = trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShut= downHandler.java PRE-CREATION = trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdown= Handler.java 1027351 = trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.jav= a 1027351 = trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 10273= 51 = trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.jav= a 1027351 = Diff: http://review.cloudera.org/r/1087/diff Testing ------- TestMasterFailover passes. Thanks, Jonathan