hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.
Date Wed, 11 Feb 2015 00:25:13 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Nauroth updated HDFS-7604:
    Attachment: HDFS-7604.005.patch

Jitendra, thank you for reviewing.  Here is patch v005, containing both of the changes that
you suggested.

bq. If volumeFailureSummary is not null, it might be more accurate to compare last failure

Yes, that's particularly relevant when considering the new live DataNode reconfiguration feature.
 If volumes are reconfigured, and there are the same number of volume failures, but the actual
volumes are different, then the old logic wouldn't have caught it.  Comparing the last failure
timestamps handles it well.

bq. In case of rolling upgrades, the older version of datanodes, will not send volumeFailureSummary,
and the newer namenode might erroneously conclude 0 volume failures.

That's a great catch.  I restored explicit tracking of the {{volumeFailures}} counter in {{DatanodeDescriptor}}.
 The implementation of {{DatanodeDescriptor#getVolumeFailures}} is fine for both old and new
DataNode heartbeats, because for the new case, we guarantee that this counter is consistent
with the value returned from {{getVolumeFailureSummary}}.

The test failure in the last Jenkins run was unrelated.

> Track and display failed DataNode storage locations in NameNode.
> ----------------------------------------------------------------
>                 Key: HDFS-7604
>                 URL: https://issues.apache.org/jira/browse/HDFS-7604
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, HDFS-7604-screenshot-3.png,
HDFS-7604-screenshot-4.png, HDFS-7604-screenshot-5.png, HDFS-7604-screenshot-6.png, HDFS-7604-screenshot-7.png,
HDFS-7604.001.patch, HDFS-7604.002.patch, HDFS-7604.004.patch, HDFS-7604.005.patch, HDFS-7604.prototype.patch
> During heartbeats, the DataNode can report a list of its storage locations that have
been taken out of service due to failure (such as due to a bad disk or a permissions problem).
 The NameNode can track these failed storage locations and then report them in JMX and the
NameNode web UI.

This message was sent by Atlassian JIRA

View raw message