hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9752) Permanent write failures may happen to slow writers during datanode rolling upgrades
Date Thu, 04 Feb 2016 23:46:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133342#comment-15133342

Chris Nauroth commented on HDFS-9752:

This is a really nice find!  The change looks good to me.

[~kihwal], regarding the test, I also am not seeing a more deterministic way to do it, barring
massive refactoring that we probably don't want to get into.  Do you think it could be made
faster by configuring {{dfs.client.datanode-restart.timeout}} to something less than 4 seconds
and then downtuning the test's sleeps accordingly?  Would that make it too unpredictable?

> Permanent write failures may happen to slow writers during datanode rolling upgrades
> ------------------------------------------------------------------------------------
>                 Key: HDFS-9752
>                 URL: https://issues.apache.org/jira/browse/HDFS-9752
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Walter Su
>            Priority: Critical
>         Attachments: HDFS-9752.01.patch
> When datanodes are being upgraded, an out-of-band ack is sent upstream and the client
does a pipeline recovery. The client may hit this multiple times as more nodes get upgraded.
 This normally does not cause any issue, but if the client is holding the stream open without
writing any data during this time, a permanent write failure can occur.
> This is because there is a limit of 5 recovery trials for the same packet, which is tracked
by "last acked sequence number". Since the empty heartbeat packets for an idle output stream
does not increment the sequence number, the write will fail after it seeing 5 pipeline breakages
by datanode upgrades.
> This check/limit was added to avoid spinning until running out of nodes in the cluster
due to a corruption or any other irrecoverable conditions.  The datanode upgrade-restart 
should be excluded from the count.

This message was sent by Atlassian JIRA

View raw message