hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9106) Transfer failure during pipeline recovery causes permanent write failures
Date Fri, 18 Sep 2015 16:28:05 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kihwal Lee updated HDFS-9106:
-----------------------------
    Description: 
When a new node is added to a write pipeline during flush/sync, if the partial block transfer
fails, the write will fail permanently without retrying or continuing with whatever is in
the pipeline. 

The transfer often fails in busy clusters due to timeout. There is no per-packet ACK between
client and datanode or between source and target datanodes. If the total transfer time exceeds
the configured timeout + 10 seconds (2 * 5 seconds slack), it is considered failed.  Naturally,
the failure rate is higher with bigger block sizes.

I propose following changes:
- Transfer timeout needs to be different from per-packet timeout.
- transfer should be retried if fails.

  was:
When a new node is added to a write pipeline during flush/sync, if the partial block transfer
fails, the write will fail permanently without retrying or continuing with whatever is in
the pipeline. 

The transfer often fails in busy clusters due to timeout. There is no per-packet ACK between
client and datanode or between source and target datanodes. If the total transfer time exceeds
the configured timeout + 10 seconds (2 * 5 seconds slack), it is considered failed.

I propose following changes:
- Transfer timeout needs to be different from per-packet timeout.
- transfer should be retried if fails.


> Transfer failure during pipeline recovery causes permanent write failures
> -------------------------------------------------------------------------
>
>                 Key: HDFS-9106
>                 URL: https://issues.apache.org/jira/browse/HDFS-9106
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Priority: Critical
>
> When a new node is added to a write pipeline during flush/sync, if the partial block
transfer fails, the write will fail permanently without retrying or continuing with whatever
is in the pipeline. 
> The transfer often fails in busy clusters due to timeout. There is no per-packet ACK
between client and datanode or between source and target datanodes. If the total transfer
time exceeds the configured timeout + 10 seconds (2 * 5 seconds slack), it is considered failed.
 Naturally, the failure rate is higher with bigger block sizes.
> I propose following changes:
> - Transfer timeout needs to be different from per-packet timeout.
> - transfer should be retried if fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message