Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 48F40200BE5 for ; Fri, 9 Dec 2016 11:18:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 47CD6160B1C; Fri, 9 Dec 2016 10:18:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8FEC9160B1D for ; Fri, 9 Dec 2016 11:18:00 +0100 (CET) Received: (qmail 14528 invoked by uid 500); 9 Dec 2016 10:17:59 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 14423 invoked by uid 99); 9 Dec 2016 10:17:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Dec 2016 10:17:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 903E72C03E0 for ; Fri, 9 Dec 2016 10:17:59 +0000 (UTC) Date: Fri, 9 Dec 2016 10:17:59 +0000 (UTC) From: "Benjamin Bannier (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MESOS-6743) Docker executor hangs forever if `docker stop` fails. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 09 Dec 2016 10:18:01 -0000 [ https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-6743: ------------------------------------ Description: If {{docker stop}} finishes with an error status, the executor should catch this and react instead of indefinitely waiting for {{reaped}} to return. An interesting question is _how_ to react. Here are possible solutions. 1. Retry {{docker stop}}. In this case it is unclear how many times to retry and what to do if {{docker stop}} continues to fail. 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. However, in this case it is unclear what status updates we should send: {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill a task? or set a specific reason in {{TASK_KILLING}}? 3. Clean up and exit. In this case we should make sure the task container is killed or notify the framework and the operator that the container may still be running. was: If {{docker stop}} finishes with an error status, the executor should catch this and react instead of indefinitely waiting for {{reaped}} to return. An interesting question is _how_ to react. Here are possible solutions. 1. Retry {{docker stop}}. In this case it is unclear how many times to retry and what to do if {{docker stop}} continues to fail. 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. However, in this case it is unclear what status updates we should send: {TASK_KILLING}} for every kill retry? an extra update when we failed to kill a task? or set a specific reason in {{TASK_KILLING}}? 3. Clean up and exit. In this case we should make sure the task container is killed or notify the framework and the operator that the container may still be running. > Docker executor hangs forever if `docker stop` fails. > ----------------------------------------------------- > > Key: MESOS-6743 > URL: https://issues.apache.org/jira/browse/MESOS-6743 > Project: Mesos > Issue Type: Bug > Components: docker > Affects Versions: 1.0.1, 1.1.0 > Reporter: Alexander Rukletsov > Labels: mesosphere > > If {{docker stop}} finishes with an error status, the executor should catch this and react instead of indefinitely waiting for {{reaped}} to return. > An interesting question is _how_ to react. Here are possible solutions. > 1. Retry {{docker stop}}. In this case it is unclear how many times to retry and what to do if {{docker stop}} continues to fail. > 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. However, in this case it is unclear what status updates we should send: {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill a task? or set a specific reason in {{TASK_KILLING}}? > 3. Clean up and exit. In this case we should make sure the task container is killed or notify the framework and the operator that the container may still be running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)