hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-9752) Permanent write failures may happen to slow writers during datanode rolling upgrades
Date Wed, 03 Feb 2016 22:11:39 GMT
Kihwal Lee created HDFS-9752:
--------------------------------

             Summary: Permanent write failures may happen to slow writers during datanode
rolling upgrades
                 Key: HDFS-9752
                 URL: https://issues.apache.org/jira/browse/HDFS-9752
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Kihwal Lee
            Priority: Critical


When datanodes are being upgraded, an out-of-band ack is sent upstream and the client does
a pipeline recovery. The client may hit this multiple times as more nodes get upgraded.  This
normally does not cause any issue, but if the client is holding the stream open without writing
any data during this time, a permanent write failure can occur.

This is because there is a limit of 5 recovery trials for the same packet, which is tracked
by "last acked sequence number". Since the empty heartbeat packets for an idle output stream
does not increment the sequence number, the write will fail after it seeing 5 pipeline breakages
by datanode upgrades.

This check/limit was added to avoid spinning until running out of nodes in the cluster due
to a corruption or any other irrecoverable conditions.  The datanode upgrade-restart  should
be excluded from the count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message