Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F58B929D for ; Tue, 13 Mar 2012 20:51:03 +0000 (UTC) Received: (qmail 3748 invoked by uid 500); 13 Mar 2012 20:51:03 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 3674 invoked by uid 500); 13 Mar 2012 20:51:03 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 3526 invoked by uid 99); 13 Mar 2012 20:51:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2012 20:51:03 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2012 20:51:02 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id F35C51EB51 for ; Tue, 13 Mar 2012 20:50:41 +0000 (UTC) Date: Tue, 13 Mar 2012 20:50:41 +0000 (UTC) From: "nkeywal (Updated) (JIRA)" To: issues@hbase.apache.org Message-ID: <1226683142.9409.1331671842005.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <376264361.7905.1331648858165.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --------------------------- Attachment: 5572.v2.patch > KeeperException.SessionExpiredException management could be improved in Master > ------------------------------------------------------------------------------ > > Key: HBASE-5572 > URL: https://issues.apache.org/jira/browse/HBASE-5572 > Project: HBase > Issue Type: Improvement > Components: master > Affects Versions: 0.96.0 > Reporter: nkeywal > Assignee: nkeywal > Priority: Minor > Fix For: 0.96.0 > > Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch > > > Synthesis: > 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. > 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. > 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired > 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. > Changes are: > 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion > 1) Removing TestMasterZKSessionRecovery > Detailed justification: > testMasterZKSessionRecoveryFailure says: > {noformat} > /** > * Negative test of master recovery from zk session expiry. > * > * Starts with one master. Fakes the master zk session expired. > * Ensures the master cannot recover the expired zk session since > * the master zk node is still there. > */ > public void testMasterZKSessionRecoveryFailure() throws Exception { > MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); > HMaster m = cluster.getMaster(); > m.abort("Test recovery from zk session expired", > new KeeperException.SessionExpiredException()); > assertTrue(m.isStopped()); > } > {noformat} > This tests works, i.e. the assertion is always verified. > But do we really want this behavior? > When looking at the code, we see that this what's happening is strange: > - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. > - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) > - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). > - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. > - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. > In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira