From issues-return-52325-archive-asf-public=cust-asf.ponee.io@mesos.apache.org Thu Nov 7 11:30:15 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8053C180630 for ; Thu, 7 Nov 2019 12:30:15 +0100 (CET) Received: (qmail 23948 invoked by uid 500); 7 Nov 2019 11:30:14 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 23939 invoked by uid 99); 7 Nov 2019 11:30:14 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Nov 2019 11:30:14 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E3D7FE2E0E for ; Thu, 7 Nov 2019 11:30:01 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 35A357804F3 for ; Thu, 7 Nov 2019 11:30:00 +0000 (UTC) Date: Thu, 7 Nov 2019 11:30:00 +0000 (UTC) From: "Benjamin Bannier (Jira)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (MESOS-9940) Framework removal may lead to inconsistent task states between master and agent. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-9940: --------------------------------------- Assignee: (was: Benjamin Bannier) > Framework removal may lead to inconsistent task states between master and agent. > -------------------------------------------------------------------------------- > > Key: MESOS-9940 > URL: https://issues.apache.org/jira/browse/MESOS-9940 > Project: Mesos > Issue Type: Bug > Components: master > Reporter: Meng Zhu > Priority: Major > Labels: foundations > > When a framework is removed from the master (say due to disconnection), master sends a `ShutdownFrameworkMessage` to the agent. At the same time, master would transition the task status to e.g. KILLED. (https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291) > When agent got the shutdown message, it would try to shutdown all the executor and destroy all the containers. The tasks' status is updated after all these are done. (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922) > However, if the executor shutdown gets stuck (e.g. due to hanging docker daemon), the task status transition will never happen. And master and agent will have diverged view of these tasks. > One consequence is that masters may try to schedule more workloads onto the problematic agent (because it thinks those task resources are freed up). Since we do not have overcommit check on agent, agent will comply and launch those tasks. This will lead to over-allocation. > One possible solution is to hold on the master status update until the agent is done with the framework shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)