Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1063E7FB3 for ; Thu, 29 Dec 2011 19:49:57 +0000 (UTC) Received: (qmail 74436 invoked by uid 500); 29 Dec 2011 19:49:56 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 74410 invoked by uid 500); 29 Dec 2011 19:49:56 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 74402 invoked by uid 99); 29 Dec 2011 19:49:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Dec 2011 19:49:56 +0000 X-ASF-Spam-Status: No, hits=-2001.3 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Dec 2011 19:49:53 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 279ED12F29F for ; Thu, 29 Dec 2011 19:49:32 +0000 (UTC) Date: Thu, 29 Dec 2011 19:49:32 +0000 (UTC) From: "jiraposter@reviews.apache.org (Commented) (JIRA)" To: issues@hbase.apache.org Message-ID: <1717838476.52364.1325188172163.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1529261440.47238.1325028270701.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177351#comment-13177351 ] jiraposter@reviews.apache.org commented on HBASE-5099: ------------------------------------------------------ bq. On 2011-12-29 19:02:17, Ted Yu wrote: bq. > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1475 bq. > bq. > bq. > Timeout value should be included in the exception message. It may not because of timeout. bq. On 2011-12-29 19:02:17, Ted Yu wrote: bq. > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 2 bq. > bq. > bq. > Year is not needed. I copied it from another test case. I can remove it. bq. On 2011-12-29 19:02:17, Ted Yu wrote: bq. > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 34 bq. > bq. > bq. > This shouldn't be small test since mini cluster is involved. Ok. bq. On 2011-12-29 19:02:17, Ted Yu wrote: bq. > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 41 bq. > bq. > bq. > We shouldn't pass 1 here since that means 1 master. It means 1 region server. Actually, I do want 1 master so it can recover itself. Otherwise, the backup master will take over and the active master doesn't have a chance to recover in this scenario. bq. On 2011-12-29 19:02:17, Ted Yu wrote: bq. > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1452 bq. > bq. > bq. > Line is 88 chars long. bq. > If ExecutorService is imported, the line should be much shorter. We can't do this since there is already an ExecutorService (from hbase) imported. I can't use the hbase ExecutorService because it doesn't fit. bq. On 2011-12-29 19:02:17, Ted Yu wrote: bq. > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 50 bq. > bq. > bq. > Since the test doesn't involve standby master, I think we should use a different name. There is test case called TestMasterFailover which involves standby master. That's why I called it TestMasterRecovery. How about TestMasterZKSessionRecovery? - Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4148 ----------------------------------------------------------- On 2011-12-29 18:38:54, Jimmy Xiang wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3323/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-12-29 18:38:54) bq. bq. bq. Review request for hbase, Ted Yu and Michael Stack. bq. bq. bq. Summary bq. ------- bq. bq. Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. bq. bq. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. bq. bq. bq. This addresses bug HBASE-5099. bq. https://issues.apache.org/jira/browse/HBASE-5099 bq. bq. bq. Diffs bq. ----- bq. bq. src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 bq. src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/3323/diff bq. bq. bq. Testing bq. ------- bq. bq. mvn -PlocalTests -Dtest=TestMaster* clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. > ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-5099 > URL: https://issues.apache.org/jira/browse/HBASE-5099 > Project: HBase > Issue Type: Bug > Affects Versions: 0.92.0, 0.94.0 > Reporter: Jimmy Xiang > Assignee: Jimmy Xiang > Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch > > > A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager > installed the tasks asynchronously, then started to wait for them to complete. > The task znodes were not created actually. The requests were just queued. > At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. > During the recovery, a new zookeeper connection was created. However, this master became the > new master again. It tried to assign root and meta. > Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. > This waiting holds the zookeeper event thread. So the async create split task is never retried since > there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira