hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad
Date Wed, 07 Oct 2015 18:02:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947280#comment-14947280
] 

Hudson commented on HDFS-9178:
------------------------------

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2437 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2437/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Slow datanode I/O can cause a wrong node to be marked bad
> ---------------------------------------------------------
>
>                 Key: HDFS-9178
>                 URL: https://issues.apache.org/jira/browse/HDFS-9178
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>             Fix For: 3.0.0, 2.7.2
>
>         Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the downstream
node can timeout on reading packet since even the heartbeat packets will not be relayed down.
 
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, the upstream
node's {{PacketResponder}} gets {{EOFException}} and it sends an ack upstream with the downstream
node status set to {{ERROR}}.  This caused the client to exclude the downstream node, even
thought the upstream node was the one got stuck.
> The connection to downstream has longer timeout, so the downstream will always timeout
 first. The downstream timeout is set in {{writeBlock()}}
> {code}
>           int timeoutValue = dnConf.socketTimeout +
>               (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>           int writeTimeout = dnConf.socketWriteTimeout +
>               (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>           NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>           OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>               writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message