hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
Date Fri, 23 Aug 2013 11:31:52 GMT

    [ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748469#comment-13748469

Rohith Sharma K S commented on YARN-1061:

I added all the ipc configurations to log4j.properities file, stil same issue recured.

bq. How can NM wait infinitely? I mean what is your connection timeout set to? 
When I debug the issue , found that it is an issue with IPC layer. This problem ocure in DataNode
to NameNode communication also.

When process is in T state(for running process, state is S1. This can be seen by "ps -p <pid>
-o pid,stat" ) i.e process is stopped using "kill -stop <pid>" , ipc proxy does not
throw any timeout exception.
This is becaue , during proxy creation RPC timetime out is set to Zero(hardcoded) at RPC.waitForProtocolProxy
method. Settiing rpc timeout to Zero makes ipc call does not throw any exception.Always ipc
call(client) retry for sendPing to server(RM).
This can be seen in Client.handleTimeout method
      private void handleTimeout(SocketTimeoutException e) throws IOException {
        if (shouldCloseConnection.get() || !running.get() || rpcTimeout > 0) {
          throw e;
        } else {

I think RPC timeout should be taken from configurations instead of hardcoding to 0.
> NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
> -------------------------------------------------------------------------------------
>                 Key: YARN-1061
>                 URL: https://issues.apache.org/jira/browse/YARN-1061
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.5-alpha
>            Reporter: Rohith Sharma K S
> It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat
response from ResouceManger where ResouceManger is in hanged up state.
> NodeManager should get timeout exception instead of waiting indefinetly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message