hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
Date Tue, 13 Mar 2012 16:10:43 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228476#comment-13228476
] 

nkeywal commented on HBASE-5572:
--------------------------------

Yes. I've done 3 modifications in the code, two like for like (hopefully!) and one with a
different behavior. I:
- removed the variable named cleanSetOfActiveMaster, replaced by "return true" or "return
false". 
- replaced the recursive call by a while(true) loop. 
- implicitly (it's hidden because there is no recursive call anymore) changed the function
behavior: we now return the final result. For this reason the function behaves differently
(we return true instead of false), but it's more on line with the method contract. This change
breaks the testMasterZKSessionRecoveryFailure, because it does not fail anymore. 

TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure was testing explicitly the
behavior with both SessionExpired AND master with same host & port. I removed it, but
I can move it to TestZooKeeper (to save a cluster start/stop) and reverse the assertion in
the test (now it does not fail).




                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One
is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex
function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if
we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case
and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort
stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will
try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return
false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try
to become the active master. If it cannot, it will return false (and that will make HMaster#abort
stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster.
blockUntilBecomingActiveMaster says it will return false if there is any error preventing
it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address.
If it's the same port & host, it deletes the nodes, that will start a recursive call to
blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and
return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return
false (even if we actually became the active master), hence the whole suite call returns false
and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session
since the master zk node is still there." but we're actually doing a check just for this and
deleting the node. If we were not ignoring the result, we would return "true", so we would
not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message