hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Bean (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-2849) Improved usability around node decommissioning and block replication on dfshealth.jsp
Date Fri, 27 Jan 2012 02:23:40 GMT
Improved usability around node decommissioning and block replication on dfshealth.jsp

                 Key: HDFS-2849
                 URL: https://issues.apache.org/jira/browse/HDFS-2849
             Project: Hadoop HDFS
          Issue Type: New Feature
          Components: documentation, name-node
    Affects Versions: 0.20.2
            Reporter: Jeff Bean

When you do this:

    - Decom a single node.
    - Underreplicated count reports all blocks.
    - Stop decom.
    - Underreplication count reduces slowly and heads to 0.

This is expected behavior of HDFS but while this is happening, utilities like dfshealth.jsp
and fsck produce high numbers of underreplicated blocks, and the node is not on the dead/decommissioned
nodes list. It's therefore unclear to novice administrators and HDFS newbies whether or not
this is a failure condition that needs administrative attention. 

Administrators find themselves constantly having to explain the under-replication number when
they could be doing better things with their time. And they're constantly getting alarms which
can be disregarded, raising fears of a "cry wolf" problem that the real issue gets lost in
the noise.

A direct quote from such an administrator:

"When a datanode fails, it's not considered a 'decommissioning', so it does not show up in
that list, it just simply kicks on the underrep and we have to hunt through the LIVE list
and attempt to find out which node caused the issue. Obviously, we (the community) are not
being told on the DEAD list when a node appears (why this information has to be withheld has
always been an issue with me, how hard is it to put a date field in the DEAD list?)"

Nevertheless, we should have more information about a dying node instead of seeing a jump
in the underrep count from 0 to millions with no real obvious reason. Perhaps add another
column saying 'DYING NODE', anything would help.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message