mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Bannier (Jira)" <j...@apache.org>
Subject [jira] [Assigned] (MESOS-9940) Framework removal may lead to inconsistent task states between master and agent.
Date Thu, 07 Nov 2019 11:30:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Bannier reassigned MESOS-9940:
---------------------------------------

    Assignee:     (was: Benjamin Bannier)

> Framework removal may lead to inconsistent task states between master and agent.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-9940
>                 URL: https://issues.apache.org/jira/browse/MESOS-9940
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Meng Zhu
>            Priority: Major
>              Labels: foundations
>
> When a framework is removed from the master (say due to disconnection), master sends
a `ShutdownFrameworkMessage` to the agent. At the same time, master would transition the task
status to e.g. KILLED. (https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291)
> When agent got the shutdown message, it would try to shutdown all the executor and destroy
all the containers. The tasks' status is updated after all these are done. (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922)
> However, if the executor shutdown gets stuck (e.g. due to hanging docker daemon), the
task status transition will never happen. And master and agent will have diverged view of
these tasks.
> One consequence is that masters may try to schedule more workloads onto the problematic
agent (because it thinks those task resources are freed up). Since we do not have overcommit
check on agent, agent will comply and launch those tasks. This will lead to over-allocation.
> One possible solution is to hold on the master status update until the agent is done
with the framework shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message