apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXCORE-743) Killed container is shown as running
Date Fri, 09 Jun 2017 18:23:21 GMT

    [ https://issues.apache.org/jira/browse/APEXCORE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044808#comment-16044808
] 

ASF GitHub Bot commented on APEXCORE-743:
-----------------------------------------

GitHub user sandeshh opened a pull request:

    https://github.com/apache/apex-core/pull/543

    APEXCORE-743 Added timeout for the Container kill request sent to NM.

    @PramodSSImmaneni @vrozov please review.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sandeshh/apex-core APEXCORE-743

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/apex-core/pull/543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #543
    
----
commit 501dfa47517f94aa35d60c4e22ec825e2c99fa27
Author: Sandesh Hegde <sandesh.hegde@gmail.com>
Date:   2017-06-01T23:28:56Z

    APEXCORE-743 Added timeout for the Container kill request sent to NM.

----


> Killed container is shown as running
> ------------------------------------
>
>                 Key: APEXCORE-743
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-743
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: Sandesh
>
> Here is the behavior
> 1. Container Heartbeat timeout happened
> 2. AppMaster sends the request to kill the container
> 3. Container is killed
> 4.  AppMaster state is not updated and no new container was allocated
> After analyzing the code here is the possible reason
> 1. Send the kill request to NM
> 2. Container killed by NM, but NM callback doesn't happen. RecoverContainer is called
in NM callback, which in this case is not called.
> 3. AppMaster state is not updated
> Possible fix.
> Have a timeout for NM callback, so that if NM doesn't respond that the container is killed
in time, call the RecoverContainer. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message