hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
Date Fri, 30 Mar 2012 23:11:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242838#comment-13242838
] 

Bikas Saha commented on HADOOP-8217:
------------------------------------

bq.ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
bq.This would solve the race as described, since when ZKFC2 calls NN1.transitionToStandby(),
it hands NN1 a higher zxid than ZKFC1 saw. So when ZKFC1 then asks it to go active, the request
is denied.

Given, the above, how will NN1 receive the zxid from ZKFC2? If it does not then the solution
is invalid. Hari's scenario exemplifies this.

                
> Edge case split-brain race in ZK-based auto-failover
> ----------------------------------------------------
>
>                 Key: HADOOP-8217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8217-testcase.txt
>
>
> As discussed in HADOOP-8206, the current design for automatic failover has the following
race:
> - ZKFC1 gets active lock
> - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
> - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
> - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
> - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation
> This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but
worth fixing, since the results can be disastrous.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message