hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10777) DataNode should report&remove volume failures if DU cannot access files
Date Wed, 31 Aug 2016 09:40:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15451746#comment-15451746

Steve Loughran commented on HDFS-10777:

# Checking for specific error messages coming out of the OS is pretty brittle w.r.t OS and
OS versions. I don't know if there is a better way.
# Disks can come back. That's especially if the disk flipped in some offline state after its
controller was being hit too hard by IO requests; that's not unusual in Linux under heavy
load (at least in the past)...ops can remount the disk and all will recover again.

This code will handle a disk coming back without restarting the DN?

> DataNode should report&remove volume failures if DU cannot access files
> -----------------------------------------------------------------------
>                 Key: HDFS-10777
>                 URL: https://issues.apache.org/jira/browse/HDFS-10777
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10777.01.patch
> HADOOP-12973 refactored DU and makes it pluggable. The refactory has a side-effect that
if DU encounters an exception, the exception is caught, logged and ignored, essentially fixes
HDFS-9908 (in which case runaway exceptions prevent DataNodes from handshaking with NameNodes).
> However, this "fix" is not good, in the sense that if the disk is bad, there is no immediate
action made by the DataNode other than logging the exception. Existing {{FsDatasetSpi#checkDataDir}}
has been reduced to only check a few number of directories blindly. If a disk goes bad, it
is often possible that only a few files are bad initially and that by checking only a small
number of directories it is easy to overlook the degraded disk.
> I propose: in addition to logging the exception, DataNode should proactively verify the
files are not accessible, remove the volume, and make the failure visible by showing it in
JMX, so that administrators can spot the failure via monitoring systems.
> A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message