hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "UnknownHost" by SteveLoughran
Date Mon, 27 Jun 2011 12:01:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "UnknownHost" page has been changed by SteveLoughran:

how to troubleshoot unknown host exceptions

New page:
= Unknown Host =

You get an Unknown Host Error -often wrapped in a Java {{{IOException}}}, when one machine
on the network cannot determine the IP address of a host that it is trying to connect to by
way of its hostname. This can happen during file upload (in which case the client machine
is has the hostname problem), or inside the Hadoop cluster.

Some possible causes (not an exclusive list):
 * The site's DNS server does not have an entry for the node. Test: do an {{{nslookup <hostname>}}}
from the client machine.
 * The calling machine's host table {{{//etc/hosts}}} lacks an entry for the host, and DNS
isn't helping out
 * There's some error in the configuration files and the hostname is actually wrong.
 * A worker node thinks it has a given name -which it reports to the NameNode and JobTracker,
but that isn't the name that the network team expect, so it isn't resolvable.
 * The calling machine is on a different subnet from the target machine, and short names are
being used instead of fully qualified domain names (FQDNs).
 * The client's network card is playing up (network timeouts, etc), the network is overloaded,
or even the switch is dropping DNS packets.
 * The host's IP address has changed but a long-lived JVM is caching the old value. This is
a known problem with JVMs (search for "java negative DNS caching" for the details and solutions).
The quick solution: restart the JVMs
 * The site's DNS server is overloaded. This can happen in large clusters. Either move to
host table entries or use caching DNS servers in every worker node.
 * Your ARP cache is corrupt, either accidentally or maliciously. If you don't know what that
means, you won't be in a position to verify this is the problem -or fix it.

These are all network configuration/router issues. As it is your network, only you can find
out and track down the problem. That said, any tooling to help Hadoop track down such problems
in cluster would be welcome, as would extra diagnostics. If you have to extend Hadoop to track
down these issues -submit your patches!

Some tactics to help solve the problem:
 1. Look for configuration problems first (Hadoop XML files, hostnames, host tables), as these
are easiest to fix and quite common.
 1. Try and identify which client machine is playing up. If it is out-of-cluster, try the
FQDN instead, and consider that it may not have access to the worker node.
 1. If the client that does not work is one of the machines in the cluster, SSH to that machine
and make sure it can resolve the hostname.
 1. As well as {{{nslookup}}}, the {{{dig}}} command is invaluable for tracking down DNS problems,
though it does assume you understand DNS records. Now is a good time to learn.
 1. Restart the JVMs to see if that makes it go away.
 1. Restart the servers to see if that makes it go away.

Remember, unless the route cause has been identified, the problem may return.

View raw message