hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5522) Datanode disk error check may be incorrectly skipped
Date Wed, 17 Feb 2016 10:03:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150220#comment-15150220

Vinayakumar B commented on HDFS-5522:

bq. So, if one node is down (eg due to a rolling restart or a crash) all of the other nodes
are very soon running checkDiskError for no particularly good reason. Coupled with HDFS-7489,
this failure can also cascade
Yes, samething has been experienced in one of our customer's cluster. 
Due to some nodes' n/w issue, all other datanodes (connected in pipeline) started checkdisk.
And without HDFS-8845 (2.7.2), all Datanode's disk I/O hit 100%.
By the time first round of diskcheck is done, some other exception requested for diskcheck
again. This continued for more than 40 hours slowing down every other application.

> Datanode disk error check may be incorrectly skipped
> ----------------------------------------------------
>                 Key: HDFS-5522
>                 URL: https://issues.apache.org/jira/browse/HDFS-5522
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.23.9, 2.2.0
>            Reporter: Kihwal Lee
>            Assignee: Rushabh S Shah
>             Fix For: 2.5.0
>         Attachments: HDFS-5522-v2.patch, HDFS-5522-v3.patch, HDFS-5522.patch
> After HDFS-4581 and HDFS-4699, {{checkDiskError()}} is not called when network errors
occur during processing data node requests.  This appears to create problems when a disk is
having problems, but not failing I/O soon. 
> If I/O hangs for a long time, network read/write may timeout first and the peer may close
the connection. Although the error was caused by a faulty local disk, disk check is not being
carried out in this case. 

This message was sent by Atlassian JIRA

View raw message