hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Esteban Gutierrez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17305) Two active HBase Masters can run at the same time under certain circumstances
Date Thu, 15 Dec 2016 20:21:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752410#comment-15752410
] 

Esteban Gutierrez commented on HBASE-17305:
-------------------------------------------

Was a regular restart [~enis]. I'm sure this is very rare. What I think is the culprit here
is this:

{code}
blockUntilBecomingActiveMaster() {
...
        this.clusterHasActiveMaster.set(true);
...
byte[] bytes = ZKUtil.getDataAndWatch(this.watcher, this.watcher.znodePaths.masterAddressZNode)
<--- [0]
...
currentMaster = ProtobufUtil.parseServerNameFrom(bytes);
...
if (ServerName.isSameHostnameAndPort(currentMaster, this.sn)) { 
            msg = ("Current master has this master's address, " +
              currentMaster + "; master was restarted? Deleting node.");
            // Hurry along the expiration of the znode.
            ZKUtil.deleteNode(this.watcher, this.watcher.znodePaths.masterAddressZNode); <---
[1]

            // We may have failed to delete the znode at the previous step, but
            //  we delete the file anyway: a second attempt to delete the znode is likely
to fail again.
            ZNodeClearer.deleteMyEphemeralNodeOnDisk();
          } else {
...
{code}

I think the problem lies between [0] and [1] when the old master thinks there was a restart
and between [0] and [1] a backup master becomes active. As I mentioned this happened in a
very short time, somewhere around 85ms but it could be less due clock jitter. 

One solution might be to update the znode instead of delete it when the there is a restart
of the active master.


> Two active HBase Masters can run at the same time under certain circumstances 
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-17305
>                 URL: https://issues.apache.org/jira/browse/HBASE-17305
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0
>            Reporter: Esteban Gutierrez
>            Assignee: Esteban Gutierrez
>            Priority: Critical
>
> This needs a little more investigation, but we found a very edgy case when the active
master is restarted and a stand-by master tries to become active, however the original active
master was able to become the active master again and just before the standby master passed
the point of the transition to become active we ended up with two active masters running at
the same time. Assuming the clock on both masters were accurate to milliseconds, this race
happened in less than 85ms. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message