hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4851) Deadlock in pipeline recovery
Date Sat, 29 Jun 2013 04:12:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696044#comment-13696044

Uma Maheswara Rao G commented on HDFS-4851:

Thanks for working on this Andrew.
This can happen wherever we do stopWriter under fsdataset lock and if old writer needs to
get this lock at that moment right.
As this only on recovery calls, your patch looks to be simple to fail the current recover
if older DX not able to proceed due the current thread held the lock already on fsdataset.
Other option may be, how about moving this stop writer call to other method and where we just
get rbw in lock and then we just interrupt without lock. After this step only we call recoverRBW.
(now recoverRBW need not stop old writer  as we moved that logic to separate call)
> Deadlock in pipeline recovery
> -----------------------------
>                 Key: HDFS-4851
>                 URL: https://issues.apache.org/jira/browse/HDFS-4851
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-4851-1.patch
> Here's a deadlock scenario that cropped up during pipeline recovery, debugged through
jstacks. Todd tipped me off to this one.
> # Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and
a new one doing recovery.
> # New DataXceiver does {{recoverRbw}}, grabbing the {{FsDatasetImpl}} lock
> # Old DataXceiver is in {{BlockReceiver#computePartialChunkCrc}}, calls {{FsDatasetImpl#getTmpInputStreams}}
and blocks on the {{FsDatasetImpl}} lock.
> # New DataXceiver {{ReplicaInPipeline#stopWriter}}, interrupting the old DataXceiver
and then joining on it.
> # Boom, deadlock. New DX holds the {{FsDatasetImpl}} lock and is joining on the old DX,
which is in turn waiting on the {{FsDatasetImpl}} lock.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message