hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
Date Sat, 31 Mar 2012 01:01:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242908#comment-13242908

Bikas Saha commented on HADOOP-8217:

Ah. The confusion was caused by 
bq. 4(failure case): if NN1.transitionToStandby() times out or fails, the non-graceful fencing
is initiated (same as in existing HA code for the last several months)
It seemed like non-graceful fencing existed in HA code for several months. You were referring
to fencing methods.

I think the piece that was missing from the solution was 
bq. 4(failure case): if NN1.transitionToStandby() times out or fails, the non-graceful fencing
is initiated

I think this is what confused me (and perhaps Hari too) into thinking that NN1 would behave
badly. On HDFS-2185 I have commented on ZKFC state diagram missing the arcs for transitionToActive/Standby()
failing. It looks like ZKFC does takes specific action there. Its just missing from the transition
diagram posted on that jira.

In this case, the problem is happening because FC2 is calling NN1.transitionToStandby() and
then FC1 is calling NN1.transitionToActive(). 
I would like to question the value of FC2 calling NN1.transitionToStandby() in general. FC1
on NN1 is supposed to call NN1.transitionToStandby() because thats is FC1's responsibility
upon losing the leader lock.
Secondly, based on the recent work done to add breadcrumbs to the ActiveStandbyElector, FC2
is going to fence NN1 if NN1 has not gracefully given up the lock, which is clearly the case
here. So the problem is already solved unless I am mistaken.

> Edge case split-brain race in ZK-based auto-failover
> ----------------------------------------------------
>                 Key: HADOOP-8217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8217-testcase.txt
> As discussed in HADOOP-8206, the current design for automatic failover has the following
> - ZKFC1 gets active lock
> - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
> - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
> - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
> - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation
> This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but
worth fixing, since the results can be disastrous.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message