hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
Date Tue, 16 Oct 2012 16:27:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477131#comment-13477131

Eli Collins commented on HDFS-3990:

bq. They will change when a pre-existing node, say one with the same storage id, is updated
with the new info.

I'm not sure re-registering with a new IP and the same storage ID actually works today.

bq. The patch appears to change the way the include and exclude work by trusting who the datanode
claims to be. What if a datanode "lies" about who it is? Or if a dns hiccup occurs when the
datanode is going to register? It sends its name as an ip, but the exclude list only has hosts.
There are a number of scenarios where a datanode could bypass the include/exclude list, which
is why we should never trust the client.

Take another look at the patch, the NN is doing the lookup not the DN, just at registration
time. How about we reject the DN registration in case of a DNS hiccup (rather than use the
DN value which the patch currently does in this case)? The DN will retry until it succeeds.
 When working on HDFS-3171 I considered removing the ability for the DN to override the hostname,
and have just one lookup per DN (ie currently both the NN and DN resolve the DN hostname).
We could open a separate jira for that, might be easier to layer this one atop it.

I'm against having DatanodeID fields that duplicates the other fields since I think we can
solve the problem here and avoid doing so. My experience from HDFS-3144 indicates we will
introduce bugs and it's hard to correctly untangle later.
> NN's health report has severe performance problems
> --------------------------------------------------
>                 Key: HDFS-3990
>                 URL: https://issues.apache.org/jira/browse/HDFS-3990
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt
> The dfshealth page will place a read lock on the namespace while it does a dns lookup
for every DN.  On a multi-thousand node cluster, this often results in 10s+ load time for
the health page.  10 concurrent requests were found to cause 7m+ load times during which time
write operations blocked.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message