hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-6721) Handle the situation where SBN is in zombie state
Date Tue, 22 Jul 2014 05:58:39 GMT
Ming Ma created HDFS-6721:

             Summary: Handle the situation where SBN is in zombie state
                 Key: HDFS-6721
                 URL: https://issues.apache.org/jira/browse/HDFS-6721
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


In HA setup, when the first NN in the service list is the SBN, RPC client will always try
the first NN, get StandbyException and then failover to the second NN in the service list,
which is the active NN.

This works pretty well when SBN is heathy. It also works well when SBN isn't running, for
example, during rolling upgrade; in which case the client will get "java.net.ConnectException:
Connection refused" right away.

When SBN is in some zombie state, for example, machine is low in memory, SBN still runs, but
can't do much, you will get ConnectTimeoutException.

14/07/21 04:12:42 DEBUG ipc.Client: Connecting to hadoop-foo-nn1/a.b.c.d:8020
14/07/21 04:13:02 DEBUG ipc.Client: closing ipc connection to hadoop-foo-nn1/a.b.c.d:8020:
20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending

When this happens, each RPC client connection will waste 20 seconds before failover. That
ends up slowing down MR jobs significantly.

Perhaps this is the responsibility of external monitoring service for HDFS; it can detect
machine in zombie state and restart the machine.

Can we have HDFS handle this automatically? States in ZK and DNs point to correct active NN.
For example, Task JVM can get the hint for active NN from the DN on the local machine.

This message was sent by Atlassian JIRA

View raw message