Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91EEA7092 for ; Thu, 22 Sep 2011 06:25:50 +0000 (UTC) Received: (qmail 5382 invoked by uid 500); 22 Sep 2011 06:25:50 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 5344 invoked by uid 500); 22 Sep 2011 06:25:50 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 5334 invoked by uid 99); 22 Sep 2011 06:25:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2011 06:25:49 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2011 06:25:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 13D16A8BB2 for ; Thu, 22 Sep 2011 06:25:28 +0000 (UTC) Date: Thu, 22 Sep 2011 06:25:28 +0000 (UTC) From: "jiraposter@reviews.apache.org (JIRA)" To: issues@hbase.apache.org Message-ID: <1883733240.1110.1316672728077.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <723160302.561.1316650886239.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112357#comment-13112357 ] jiraposter@reviews.apache.org commented on HBASE-4455: ------------------------------------------------------ ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2007/#review2019 ----------------------------------------------------------- looks overall pretty good - been meaning to do a similar change to split up shutdown processing but didn't find the time. A few nits below. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java typo: unavailable http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java debug http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java why catch this if you just rethrow it? http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java should rename "splitHLog" to "shouldSplitHLog" -- since "splitHLog" could be read as past tense "alreadySplitHLog" - Todd On 2011-09-22 00:38:16, Ming Ma wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2007/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-09-22 00:38:16) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. ------- bq. bq. 1. Add more logging. bq. 2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation bq. 4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown. bq. 5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions. bq. bq. bq. This addresses bug HBASE-4455. bq. https://issues.apache.org/jira/browse/HBASE-4455 bq. bq. bq. Diffs bq. ----- bq. bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 bq. bq. Diff: https://reviews.apache.org/r/2007/diff bq. bq. bq. Testing bq. ------- bq. bq. Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time. bq. bq. bq. Thanks, bq. bq. Ming bq. bq. > Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager > ---------------------------------------------------------------------------------- > > Key: HBASE-4455 > URL: https://issues.apache.org/jira/browse/HBASE-4455 > Project: HBase > Issue Type: Bug > Reporter: Ming Ma > Assignee: Ming Ma > > Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues. > 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898. > T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6 > 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio > n-server and set watcher; sea-lab-3,60020,1316380037898 > T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656 > T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde > d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898 > T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6 > 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656 > 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira