hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup
Date Wed, 12 Dec 2018 11:10:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718798#comment-16718798
] 

Ravi Prakash commented on HADOOP-15129:
---------------------------------------

Hi Karthik! Thanks for your contribution. Could you please rebase the patch to the latest
trunk? I usually apply patches using
{code:java}
$ git apply <patch-file>{code}
A few suggestions:
 # Could you please use short descriptions in JIRA? I was told a long time ago. :)
 # When using JIRA numbers, could you please write HDFS-8068 (instead of just 8068) because
issues often cut across several different projects, and this way JIRA creates nice links for
viewers to click on?

Patches are usually committed to trunk *first* and then a (possibly) different version of
the patch may be committed to earlier branches like branch-2. So technically you could have
used neat Lambdas in the trunk patch. ;) Its a nit though.

I'm trying to find the wikipage that tried to explain certain errors. I'm afraid I rarely
found them useful (its probably because we didn't really expand on those wiki pages ever),
so I'm fine with a more helpful error in the logs.

 

> Datanode caches namenode DNS lookup failure and cannot startup
> --------------------------------------------------------------
>
>                 Key: HADOOP-15129
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15129
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.8.2
>         Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>            Reporter: Karthik Palaniappan
>            Assignee: Karthik Palaniappan
>            Priority: Minor
>         Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each namenode.
Though there are retries on connection failure throughout the stack, the same InetSocketAddress
is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to IP addresses
on construction, and it is never refreshed. Hadoop re-creates an InetSocketAddress in some
cases just in case the remote IP has changed for a particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains unresolved" --
referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh
request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode for null
remains unresolved for ID null. Check your hdfs-site.xml file to ensure namenodes are configured
properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting
BPOfferServices for nameservices: <default>
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool
<registering> (Datanode Uuid unassigned) service to cluster-32f5-m:8020 starting to
offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if the DN is
configured to use a proxy. Since I'm not using a proxy, it forever prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem
connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem
connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem
connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem
connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem
connecting to server: cluster-32f5-m:8020
> {code}
> Unfortunately, the log doesn't contain the exception that triggered it, but the culprit
is actually in IPC Client: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
> This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 to give
a clear error message when somebody mispells an address.
> However, the fix in HADOOP-7472 doesn't apply here, because that code happens in Client#getConnection
after the Connection is constructed.
> My proposed fix (will attach a patch) is to move this exception out of the constructor
and into a place that will trigger HADOOP-7472's logic to re-resolve addresses. If the DNS
failure was temporary, this will allow the connection to succeed. If not, the connection will
fail after ipc client retries (default 10 seconds worth of retries).
> I want to fix this in ipc client rather than just in Datanode startup, as this fixes
temporary DNS issues for all of Hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message