apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXCORE-743) Killed container is shown as running
Date Fri, 14 Jul 2017 23:14:01 GMT

    [ https://issues.apache.org/jira/browse/APEXCORE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088265#comment-16088265

ASF GitHub Bot commented on APEXCORE-743:

PramodSSImmaneni commented on a change in pull request #543: APEXCORE-743 Added timeout for
the Container kill request sent to NM.
URL: https://github.com/apache/apex-core/pull/543#discussion_r127565697

 File path: engine/src/main/java/com/datatorrent/stram/StreamingAppMasterService.java
 @@ -138,6 +139,7 @@
    * This should be replaced when a constant is defined there
   private static final String SSL_SERVER_KEYSTORE_LOCATION = "ssl.server.keystore.location";
+  private static final int NODE_MANAGER_KILL_CONTAINER_TIMEOUT = 30 * 1000;
 Review comment:
   Can you make it configurable by a system property. See bufferserver.server.Server.BACK_PRESSURE_ENABLED
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Killed container is shown as running
> ------------------------------------
>                 Key: APEXCORE-743
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-743
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: Sandesh
>            Assignee: Sandesh
> Here is the behavior
> 1. Container Heartbeat timeout happened
> 2. AppMaster sends the request to kill the container
> 3. Container is killed
> 4.  AppMaster state is not updated and no new container was allocated
> After analyzing the code here is the possible reason
> 1. Send the kill request to NM
> 2. Container killed by NM, but NM callback doesn't happen. RecoverContainer is called
in NM callback, which in this case is not called.
> 3. AppMaster state is not updated
> Possible fix.
> Have a timeout for NM callback, so that if NM doesn't respond that the container is killed
in time, call the RecoverContainer. 

This message was sent by Atlassian JIRA

View raw message