hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-3990) NN's health report has severe performance problems
Date Mon, 15 Oct 2012 22:57:04 GMT

     [ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eli Collins updated HDFS-3990:

    Attachment: hdfs-3990.txt

Maintaining both an ipAddr/hostName plus nodeAddr with the same information, which can become
inconsistent is error prone. For example what do you do when the ipAddr and the nodeAddr disagree?
The ipAddr field for a DataNode ID should never change because it (and the xferPort) are the
unique key for a DataNode. We also now have to worry about the state where we're both resolved
and unresolved. Given that the crux of the problem is that we want to cache the DNS lookup
for the ipAddr of a DN, it seems simplest to just do that. 

What do you think of the attached patch? It sets the DatanodeID hostname field at registration
time (like the IP addr) using the same lookup we do today and replaces the two problematic
lookups with uses of this field.

This breaks {{dfs.datanode.hostname}} but this config is only used by the tests and we can
fix those up. I'm happy to do that in another rev of this patch if you like the approach.
I think a better approach would be to just use the lookup on the DN side (ie have the NN use
the DN reported value) but that's a more risky change.
> NN's health report has severe performance problems
> --------------------------------------------------
>                 Key: HDFS-3990
>                 URL: https://issues.apache.org/jira/browse/HDFS-3990
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt
> The dfshealth page will place a read lock on the namespace while it does a dns lookup
for every DN.  On a multi-thousand node cluster, this often results in 10s+ load time for
the health page.  10 concurrent requests were found to cause 7m+ load times during which time
write operations blocked.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message