hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13404) RPC call hangs when server side CPU overloaded
Date Fri, 22 Jul 2016 16:41:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389805#comment-15389805

Daryn Sharp commented on HADOOP-13404:

There is definitely a timeout mechanism in the client. Depending on your release, ipc.client.rpc-timeout.ms=timeout;
or ipc.client.ping=false, ipc.ping.interval=timeout.  If you set ipc.client.ping=true, it
will just verify that the connection is up, not that the other end is responsive.

After a failover, the client will get standby exceptions as the handlers drain the callq.
 The standby to active transition should not have occurred until the former went into standby
(which sends the standby exceptions).  The transition must have been forced and effectively
created an active/active condition.  In this invalid state, yes, clients with no timeout will
hang forever while the NN is hung - "Works as designed".  If the hung active was stopped,
clients would failover.  

Please test the timeouts and close as invalid unless there's a bug in the timeouts or standby

bq. Need client side and server side modification, which may have some compatibility issue.
Although not needed here, -100 for an incompatible change.

> RPC call hangs when server side CPU overloaded
> ----------------------------------------------
>                 Key: HADOOP-13404
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13404
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Peter Shi
> In our reliability test, in namenode, inject fault like cpu 100% consumed, after fault
injection, for existing connection, all the request will hangs forever, not timeout. for new
coming connection, it will failover to another namenode in HA deployment.
> There is no timeout mechanism for calls on established connection.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message