hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-14322) Incorrect host info may be reported in failover message
Date Tue, 18 Apr 2017 19:49:41 GMT
Yongjun Zhang created HADOOP-14322:

             Summary: Incorrect host info may be reported in failover message
                 Key: HADOOP-14322
                 URL: https://issues.apache.org/jira/browse/HADOOP-14322
             Project: Hadoop Common
          Issue Type: Bug
          Components: common
            Reporter: Yongjun Zhang

This may apply to other components, but using HDFS as an example.

When multiple threads use the same DFSClient to make RPC calls, they may report incorrect
NN host name in the failover message:
INFO [pool-3-thread-13] retry.RetryInvocationHandler (RetryInvocationHandler.java:invoke(148))
- Exception while invoking delete of class ClientNamenodeProtocolTranslatorPB over *a.b.c.d*:8020.
Trying to fail over immediately.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category
WRITE is not supported in state standby. Visit https://s.apache.org/sbnn-error

where *a.b.c.d* is the RPC proxy corresponds to the active NN, which confuses user to think
failover is not behaving correctly. Because *a.b.c.d*  is expected to be the proxy corresponding
to the standby NN here instead.

The reason is that when the ProxyDescriptor data field of RetryInvocationHandler may be shared
by multiple threads that do the RPC calls, the failover done by one thread (which changed
the rpc proxy) may be visible to other threads when the other threads report the above message.

An example sequence: 
# multiple threads start with the same SNN to do RPC calls, 
# all threads discover that a failover is needed, 
# thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
# other threads reports the above message with the proxyInfo changed by thread X, and reported
ANN instead of SNN in the message.

Some details:

RetryInvocationHandler does the following when failing over:
  synchronized void failover(long expectedFailoverCount, Method method,
                               int callId) {
      // Make sure that concurrent failed invocations only cause a single
      // actual failover.
      if (failoverCount == expectedFailoverCount) {
      } else {
        LOG.warn("A failover has occurred since the start of call #" + callId
            + " " + proxyInfo.getString(method.getName()));
      proxyInfo = fpp.getProxy();
and changed the proxyInfo in the ProxyDescriptor.

While the log method below report message with ProxyDescriotor's proxyinfo:
private void log(final Method method, final boolean isFailover,
      final int failovers, final long delay, final Exception ex) {
   final StringBuilder b = new StringBuilder()
        .append(ex + ", while invoking ")
    if (failovers > 0) {
      b.append(" after ").append(failovers).append(" failover attempts");
    b.append(isFailover? ". Trying to failover ": ". Retrying ");
    b.append(delay > 0? "after sleeping for " + delay + "ms.": "immediately.");
and so does  {{handleException}} method do
        if (LOG.isDebugEnabled()) {
          LOG.debug("Exception while invoking call #" + callId + " "
              + proxyDescriptor.getProxyInfo().getString(method.getName())
              + ". Not retrying because " + retryInfo.action.reason, e);

   public String getString(String methodName) {
      return proxy.getClass().getSimpleName() + "." + methodName
          + " over " + proxyInfo;

    public String toString() {
      return proxy.getClass().getSimpleName() + " over " + proxyInfo;

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message