hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5014) BPOfferService#processCommandFromActor() synchronization on namenode RPC call delays IBR to Active NN, if Stanby NN is unstable
Date Sat, 24 Aug 2013 19:38:51 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749447#comment-13749447
] 

Chris Nauroth commented on HDFS-5014:
-------------------------------------

I'm going to be mostly offline until 9/2.  I'm not +1 for the current patch, but I would be
+1 if it was changed to hold the write lock for all of {{updateActorStatesFromHeartbeat}}.
 I'd like to seek out a second opinion though, because concurrency is hard.  :-)  [~tlipcon],
maybe you could take a look since you originally coded HDFS-2627?

Vinay, thank you for reporting the bug and working through the feedback on the patch.

bq. I dint understand how we could get 'some unfortunate interleavings'. Can you please explain..
I thought since we are reading under lock, we would get correct state always.

The challenge is that we are updating multiple pieces of data ({{bpServiceToActive}}, {{bposThinksActive}},
{{lastActiveClaimTxId}}, and {{isMoreRecentClaim}}), and the failover handling logic isn't
correct unless those pieces of data are consistent with one another.  Without mutual exclusion,
we can end up with {{bposThinksActive}} true even though {{bpServiceToActive}} has changed,
or we could end up with {{isMoreRecentClaim}} true, even though a more recent transaction
ID has been seen.

In the following examples, assume T1 and T2 are separate threads for separate {{BPServiceActor}}
instances/separate name nodes.

Example 1: race on {{bposThinksActive}}

# T1 acquires read lock.
# T1 calculates {{bposThinksActive}} true.
# T1 releases read lock.
# The OS suspends T1 here.  (Note that T1 is holding neither read lock nor write lock at this
point.)
# T2 wakes up, processes a heartbeat, and executes all of {{updateActorStatesFromHeartbeat}}.
 It determines that T2 is the new active, so it updates {{bpServiceToActive}}.
# The OS suspends T2 here.
# T1 resumes.  It still thinks {{bposThinksActive}} is true, even though the other thread
has changed {{bpServiceToActive}}.
# T1 now can fall into the block for {{else if (!nnClaimsActive && bposThinksActive)}}
and set {{bpServiceToActive}} null.
# The OS suspends T1 and resumes T2.
# T2 now erroneously skips commands from the active NN, because {{bpServiceToActive}} is null.

Example 2: race on {{isMoreRecentClaim}}

# T1 acquires read lock.
# T1 calculates {{isMoreRecentClaim}} based on {{lastActiveClaimTxId}} 1 and {{txid}} 2 from
the heartbeat.
# T1 releases read lock.
# The OS suspends T1 here.  (Note that T1 is holding neither read lock nor write lock at this
point.)
# T2 wakes up, processes a heartbeat, and executes all of {{updateActorStatesFromHeartbeat}}.
 It updates {{lastActiveClaimTxId}} to 3.
# The OS suspends T2 here.
# T1 resumes.  It still thinks {{isMoreRecentClaim}} is true, even though the other thread
has changed {{lastActiveClaimTxId}}.
# T1 now can set {{bpServiceToActive}} to itself instead of going into the split-brain warning
path.
# The OS suspends T1 and resumes T2.
# T2 now erroneously skips commands from the active NN, because {{bpServiceToActive}} was
changed.

bq. Actually if we try holding writeLock() throughout method, then that will be same as adding
method level synchronization. where we would get the same issue as this jira.

My understanding is that you had an actor get stuck in the re-registration loop when it got
a {{DNA_REGISTER}} request from a flapping namenode.  Since {{processCommandFromActor}} was
synchronized, this also blocked the actor thread for the healthy namenode from making progress.
 Therefore, we can solve the problem by allowing concurrent executions of {{processCommandFromActor}},
so that the healthy actor can make progress even if the unhealthy actor gets stuck.  However,
we don't need to allow concurrent execution of the heartbeat/HA failover processing (and I
think it would be incorrect to do so based on the examples above).

Unlike {{processCommandFromActor}}, the {{updateActorStatesFromHeartbeat}} method only changes
local in-memory state.  There is no polling loop or RPC involved, so there is no risk that
one actor gets stuck inside this method and blocks the other.  I expect minimal lock contention
on the write lock.

                
> BPOfferService#processCommandFromActor() synchronization on namenode RPC call delays
IBR to Active NN, if Stanby NN is unstable
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5014
>                 URL: https://issues.apache.org/jira/browse/HDFS-5014
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ha
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Vinay
>            Assignee: Vinay
>         Attachments: HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch
>
>
> In one of our cluster, following has happened which failed HDFS write.
> 1. Standby NN was unstable and continously restarting due to some errors. But Active
NN was stable.
> 2. MR Job was writing files.
> 3. At some point SNN went down again while datanode processing the REGISTER command for
SNN. 
> 4. Datanodes started retrying to connect to SNN to register at the following code  in
BPServiceActor#retrieveNamespaceInfo() which will be called under synchronization.
> {code}      try {
>         nsInfo = bpNamenode.versionRequest();
>         LOG.debug(this + " received versionRequest response: " + nsInfo);
>         break;{code}
> Unfortunately in all datanodes at same point this happened.
> 5. For next 7-8 min standby was down, and no blocks were reported to active NN at this
point and writes have failed.
> So culprit is {{BPOfferService#processCommandFromActor()}} is completely synchronized
which is not required.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message