hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4851) Deadlock in pipeline recovery
Date Mon, 01 Jul 2013 18:50:22 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697062#comment-13697062
] 

Andrew Wang commented on HDFS-4851:
-----------------------------------

Hey Uma, thanks for taking a look!

I may not understand your proposal entirely, but I found it pretty complex to interrupt while
not holding the lock (see the patch in HDFS-3655 for the general idea).

The core issue is that more recovery threads can keep coming in, so even if we interrupt the
current old writer, by the time we re-get the FSD lock to rbw.setWriter to ourselves, some
other recovery thread might have again come in and we need to interrupt them too. Repeating
the stopWriter requires re-doing the precondition checks in the three places we call stopWriter,
each of which have different preconditions.

Would love if a simpler or better solution is present though, so please let me know if I missed
something.
                
> Deadlock in pipeline recovery
> -----------------------------
>
>                 Key: HDFS-4851
>                 URL: https://issues.apache.org/jira/browse/HDFS-4851
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-4851-1.patch
>
>
> Here's a deadlock scenario that cropped up during pipeline recovery, debugged through
jstacks. Todd tipped me off to this one.
> # Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and
a new one doing recovery.
> # New DataXceiver does {{recoverRbw}}, grabbing the {{FsDatasetImpl}} lock
> # Old DataXceiver is in {{BlockReceiver#computePartialChunkCrc}}, calls {{FsDatasetImpl#getTmpInputStreams}}
and blocks on the {{FsDatasetImpl}} lock.
> # New DataXceiver {{ReplicaInPipeline#stopWriter}}, interrupting the old DataXceiver
and then joining on it.
> # Boom, deadlock. New DX holds the {{FsDatasetImpl}} lock and is joining on the old DX,
which is in turn waiting on the {{FsDatasetImpl}} lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message