Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 991D9200CD4 for ; Sat, 15 Jul 2017 01:14:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 95F3716EB41; Fri, 14 Jul 2017 23:14:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DB43216EB40 for ; Sat, 15 Jul 2017 01:14:06 +0200 (CEST) Received: (qmail 15678 invoked by uid 500); 14 Jul 2017 23:14:06 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 15665 invoked by uid 99); 14 Jul 2017 23:14:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2017 23:14:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 836B91800EA for ; Fri, 14 Jul 2017 23:14:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id IRgp5vYH02eJ for ; Fri, 14 Jul 2017 23:14:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 68FB55FC1C for ; Fri, 14 Jul 2017 23:14:04 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 35476E0641 for ; Fri, 14 Jul 2017 23:14:03 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 8FACA2475C for ; Fri, 14 Jul 2017 23:14:01 +0000 (UTC) Date: Fri, 14 Jul 2017 23:14:01 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@apex.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (APEXCORE-743) Killed container is shown as running MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 14 Jul 2017 23:14:07 -0000 [ https://issues.apache.org/jira/browse/APEXCORE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088265#comment-16088265 ] ASF GitHub Bot commented on APEXCORE-743: ----------------------------------------- PramodSSImmaneni commented on a change in pull request #543: APEXCORE-743 Added timeout for the Container kill request sent to NM. URL: https://github.com/apache/apex-core/pull/543#discussion_r127565697 ########## File path: engine/src/main/java/com/datatorrent/stram/StreamingAppMasterService.java ########## @@ -138,6 +139,7 @@ * This should be replaced when a constant is defined there */ private static final String SSL_SERVER_KEYSTORE_LOCATION = "ssl.server.keystore.location"; + private static final int NODE_MANAGER_KILL_CONTAINER_TIMEOUT = 30 * 1000; Review comment: Can you make it configurable by a system property. See bufferserver.server.Server.BACK_PRESSURE_ENABLED ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org > Killed container is shown as running > ------------------------------------ > > Key: APEXCORE-743 > URL: https://issues.apache.org/jira/browse/APEXCORE-743 > Project: Apache Apex Core > Issue Type: Bug > Reporter: Sandesh > Assignee: Sandesh > > Here is the behavior > 1. Container Heartbeat timeout happened > 2. AppMaster sends the request to kill the container > 3. Container is killed > 4. AppMaster state is not updated and no new container was allocated > After analyzing the code here is the possible reason > 1. Send the kill request to NM > 2. Container killed by NM, but NM callback doesn't happen. RecoverContainer is called in NM callback, which in this case is not called. > 3. AppMaster state is not updated > Possible fix. > Have a timeout for NM callback, so that if NM doesn't respond that the container is killed in time, call the RecoverContainer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)