hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.
Date Mon, 02 Feb 2015 22:06:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302025#comment-14302025
] 

Chris Nauroth edited comment on HDFS-7604 at 2/2/15 10:05 PM:
--------------------------------------------------------------

I've done another mock-up of the UI.  This version avoids adding clutter to the existing Datanodes
page and instead moves failure information to its own dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for Total Failed
Volumes.  I also intend to display lost capacity in parentheses next to it.  However, unlike
last time, the existing Datanodes page is unchanged.  Instead, the volume failure information
is on a new Datanode Volumes page, shown in new screenshot 4.  This is hyperlinked from both
the Total Failed Volumes field in the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume failures.  For each
one, it displays the address, seconds since last contact, time of last volume failure, number
of failed volumes, estimated capacity lost due to these volume failures, and a list of every
failed storage location's path.  I say that the capacity lost is an estimate, because there
are going to be some edge cases that could prevent us from displaying accurate information
here.  For example, if a volume has an I/O error before we get a chance to check its capacity,
then it's unknown how much storage is available on that volume.

The end user workflow I imagine for this is that an admin first checks the summary information
and notices a non-zero count for failed volumes.  Then, the admin navigates to the Datanode
Volumes page to get a list of volume failures across the cluster.  This view lists only the
DataNodes with volume failures, so the admin won't need to scan through the master list looking
for individual nodes with a non-zero volume failure count.  This can act as a sort of work
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to provide this
data for the UI.  Meanwhile, Last Failure Time and Estimated Capacity Lost are displayed as
TODO in the screenshot.  Further feedback is welcome while I continue coding a new patch.


was (Author: cnauroth):
I've done another mock-up of the UI.  This version avoids adding clutter to the existing Datanodes
page and instead moves failure information to its own dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for Total Failed
Volumes.  I also intend to display lost capacity in parentheses next to it.  However, unlike
last time, the existing Datanodes page is unchanged.  Instead, the volume failure information
is on a new Datanode Volumes page.  This is hyperlinked from both the Total Failed Volumes
field in the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume failures.  For each
one, it displays the address, seconds since last contact, time of last volume failure, number
of failed volumes, estimated capacity lost due to these volume failures, and a list of every
failed storage location's path.  I say that the capacity lost is an estimate, because there
are going to be some edge cases that could prevent us from displaying accurate information
here.  For example, if a volume has an I/O error before we get a chance to check its capacity,
then it's unknown how much storage is available on that volume.

The end user workflow I imagine for this is that an admin first checks the summary information
and notices a non-zero count for failed volumes.  Then, the admin navigates to the Datanode
Volumes page to get a list of volume failures across the cluster.  This view lists only the
DataNodes with volume failures, so the admin won't need to scan through the master list looking
for individual nodes with a non-zero volume failure count.  This can act as a sort of work
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to provide this
data for the UI.  Meanwhile, Last Failure Time and Estimated Capacity Lost are displayed as
TODO in the screenshot.  Further feedback is welcome while I continue coding a new patch.

> Track and display failed DataNode storage locations in NameNode.
> ----------------------------------------------------------------
>
>                 Key: HDFS-7604
>                 URL: https://issues.apache.org/jira/browse/HDFS-7604
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, HDFS-7604-screenshot-3.png,
HDFS-7604-screenshot-4.png, HDFS-7604.001.patch, HDFS-7604.prototype.patch
>
>
> During heartbeats, the DataNode can report a list of its storage locations that have
been taken out of service due to failure (such as due to a bad disk or a permissions problem).
 The NameNode can track these failed storage locations and then report them in JMX and the
NameNode web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message