mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Conway (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-6608) Do not transition tasks to TASK_KILLED on framework teardown
Date Fri, 18 Nov 2016 20:30:58 GMT
Neil Conway created MESOS-6608:
----------------------------------

             Summary: Do not transition tasks to TASK_KILLED on framework teardown
                 Key: MESOS-6608
                 URL: https://issues.apache.org/jira/browse/MESOS-6608
             Project: Mesos
          Issue Type: Bug
          Components: master
            Reporter: Neil Conway


When a framework is torn down or disconnects, we currently transition the framework's tasks
to state TASK_KILLED at the master. See

* https://reviews.apache.org/r/25250
* MESOS-1736

This happens at the master; concurrently, the master sends a {{ShutdownFrameworkMessage}}
to each agent that is running one of the framework's tasks.

Marking the task KILLED in this manner is problematic for two reasons:

# The task is still running and may continue running for an unbounded length of time if the
agent becomes partitioned.
# KILLED is usually used to denote tasks that are killed in response to a "kill task" operation.

My primary concern here is #1. We could pick a different terminal state to address #2 but
I think that is secondary: transitioning the task to _any_ terminal state before it has been
terminated is problematic, in my view.

Proposed behavior: when the framework teardown is applied, we keep the task in its current
state at the master. Then when the agent receives the {{ShutdownFrameworkMessage}}, it can
shutdown the task and eventually respond with a terminal status update. At that point we can
transition the task into the appropriate terminal state (whether it be KILLED, FAILED, GONE,
or a new state).

This will probably require some changes to the status update machinery, since we currently
drop status updates for terminating frameworks at the slave. Since the scheduler is gone,
we'd need to have the master ack the status update rather than the framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message