hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4851) Deadlock in pipeline recovery
Date Wed, 26 Jun 2013 21:13:20 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andrew Wang updated HDFS-4851:

    Attachment: hdfs-4851-1.patch

I realized that HDFS-3655 is actually addressing the same issue and tried to revive that approach,
but it ended up being super complicated to check and recheck the preconditions on lock acquisition.

Attached here instead is a simpler strategy: abort recovery if we end up waiting too long
on this lock. While not optimal, it should be safe since the client can retry recovery again.

Since this is very hard to unit test, I tested by adding a loop that grabs the lock repeatedly
in {{receivePacket}}, verified that this caused the deadlock, and then applied the patch and
verified that the error message printed.

HDFS-3655 can be where we properly fix this issue, or more broadly re-examine finer grained
locking during recovery.
> Deadlock in pipeline recovery
> -----------------------------
>                 Key: HDFS-4851
>                 URL: https://issues.apache.org/jira/browse/HDFS-4851
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-4851-1.patch
> Here's a deadlock scenario that cropped up during pipeline recovery, debugged through
jstacks. Todd tipped me off to this one.
> # Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and
a new one doing recovery.
> # New DataXceiver does {{recoverRbw}}, grabbing the {{FsDatasetImpl}} lock
> # Old DataXceiver is in {{BlockReceiver#computePartialChunkCrc}}, calls {{FsDatasetImpl#getTmpInputStreams}}
and blocks on the {{FsDatasetImpl}} lock.
> # New DataXceiver {{ReplicaInPipeline#stopWriter}}, interrupting the old DataXceiver
and then joining on it.
> # Boom, deadlock. New DX holds the {{FsDatasetImpl}} lock and is joining on the old DX,
which is in turn waiting on the {{FsDatasetImpl}} lock.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message