hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Palaniappan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup
Date Tue, 19 Dec 2017 00:35:00 GMT
Karthik Palaniappan created HADOOP-15129:
--------------------------------------------

             Summary: Datanode caches namenode DNS lookup failure and cannot startup
                 Key: HADOOP-15129
                 URL: https://issues.apache.org/jira/browse/HADOOP-15129
             Project: Hadoop Common
          Issue Type: Bug
          Components: ipc
    Affects Versions: 2.8.2
         Environment: Google Compute Engine, or any environment where a small percent of DNS
lookups fail.

I'm using Java 8, Debian 8, Hadoop 2.8.2.
            Reporter: Karthik Palaniappan
            Priority: Minor


On startup, the Datanode creates an InetSocketAddress to register with each namenode. Though
there are retries on connection failure throughout the stack, the same InetSocketAddress is
reused.

InetSocketAddress is an interesting class, because it resolves DNS names to IP addresses on
construction, and it is never refreshed. Hadoop re-creates an InetSocketAddress in some cases
just in case the remote IP has changed for a particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.

Anyway, on startup, you cna see the Datanode log: "Namenode...remains unresolved" -- referring
to the fact that DNS lookup failed.

{code:java}
2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request
received for nameservices: null
2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode for null remains
unresolved for ID null. Check your hdfs-site.xml file to ensure namenodes are configured properly.
2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices
for nameservices: <default>
2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering>
(Datanode Uuid unassigned) service to cluster-32f5-m:8020 starting to offer service
{code}

The Datanode then proceeds to use this unresolved address, as it may work if the DN is configured
to use a proxy. Since I'm not using a proxy, it forever prints out this message:


{code:java}
2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: cluster-32f5-m:8020
2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: cluster-32f5-m:8020
2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: cluster-32f5-m:8020
2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: cluster-32f5-m:8020
2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: cluster-32f5-m:8020

{code}

Unfortunately, the log doesn't contain the exception that triggered it, but the culprit is
actually in IPC Client: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.

This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 to give a clear
error message when somebody mispells an address.

However, the fix in HADOOP-7472 doesn't apply here, because that code happens in Client#getConnection
after the Connection is constructed.

My proposed fix (will attach a patch) is to move this exception out of the constructor and
into a place that will trigger HADOOP-7472's logic to re-resolve addresses. If the DNS failure
was temporary, this will allow the connection to succeed. If not, the connection will fail
after ipc client retries (default 10 seconds worth of retries).

I want to fix this in ipc client rather than just in Datanode startup, as this fixes temporary
DNS issues for all of Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message