hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7642) NameNode should periodically log DataNode decommissioning progress
Date Mon, 03 Oct 2016 21:01:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543411#comment-15543411
] 

Andrew Wang commented on HDFS-7642:
-----------------------------------

Thanks for working on this Sean, one meta comment and then some code-related ones:

What normally happens is that decom gets stuck at the end because of open-for-write files.
So, as an operator, often what you want to know is:

* Is this datanode still making progress?
* If not, is it blocked on open-for-write files? What are these files? Which client is keeping
these files open?

I'm not sure that adding more logging really helps with this. We already have logging in logBlockReplicationInfo
that gives you similar status information, but the remaining gaps are in understanding the
rate of decommissioning (which might be better addressed with per-DN rate metrics) and in
some debug tool that dumps the open-for-write files for a DN and the corresponding clients
who own the file leases (HDFS-10480 is along those lines). What do you think?

Code related:

* Can we make the new class static?
* We can use primitives (int) rather than objects (Integer) for better efficiency
* Recommend we change this to debug logging, decom can take hours and be done on 10s of nodes
at a time, printing like this can be spammy
* It would also be useful to track when this node was set to "decommissioning" status, so
you can judge the rate of progress.

> NameNode should periodically log DataNode decommissioning progress
> ------------------------------------------------------------------
>
>                 Key: HDFS-7642
>                 URL: https://issues.apache.org/jira/browse/HDFS-7642
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Zhe Zhang
>            Assignee: Sean Mackrory
>            Priority: Minor
>         Attachments: HDFS-7642.001.patch
>
>
> We've see a case where the decommissioning was stuck due to some files have more replicas
then DNs. HDFS-5662 fixes this particular issue but there are other use cases where the decommissioning
process might get stuck or slow down. Some monitoring / logging will help debugging those
issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message