mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Meng Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-8624) Valid tasks may be explicitly dropped by agent due to race conditions
Date Mon, 05 Mar 2018 19:18:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386580#comment-16386580
] 

Meng Zhu commented on MESOS-8624:
---------------------------------

Ticket description updated to better clarify the agent behavior before/after MESOS-1720.

> Valid tasks may be explicitly dropped by agent due to race conditions
> ---------------------------------------------------------------------
>
>                 Key: MESOS-8624
>                 URL: https://issues.apache.org/jira/browse/MESOS-8624
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Meng Zhu
>            Assignee: Meng Zhu
>            Priority: Critical
>
> Tasks may be explicitly dropped by the agent if all the following conditions are met:
> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls use the same executor.
> (2) The executor currently does not exist on the agent.
> (3) Due to some race conditions, these tasks are trying to launch on the agent in a different
order from their original launch order. (See below how this could happen)
> In this case, tasks that are trying to launch on the agent before the first task in the
original order will be explicitly dropped by the agent (TASK_DROPPED` or `TASK_LOST` will
be sent)). 
> Up until now, Mesos does not guarantee in-order task launch on the agent. Let's say Mesos
master sends two `launchTask` messages (launch Task1 and Task2) to an agent. In most cases
(except MESOS-3870), these messages are delivered to the agent in order. However, currently,
there are two asynchronous steps (unschedule GC and task authorization) in the agent task
launch path. Depending on the CPU scheduling order, task2 launch may finish these two steps
earlier than task1 and get to the launch executor stage before task1.
> In this case, prior to MESOS-1720, these two tasks will still get launched. If task1
and task2 use the same executor, whoever reaches the launch executor stage first, will launch
the executor.
> However, after resolving MESOS-1720, agents start to enforce some order for tasks using
the same executor. Specifically, when master crafts the launch task message, it will specify
the `launch_executor` flag. Thus Task1 in the above case will have `launch_executor` flag
set to true. And task2 (and any subsequent tasks that use the same executor) will have the
flag set to false.
> If task2 reaches the launch executor stage before task1 (due to the race condition described
above), the agent will see that its `launch_executor ` is false but the executor specified
in the `launchTask` message is not running. As a result, it will explicitly drop task2 as
in:
> https://github.com/apache/mesos/blob/32f6d4eec2724414e217875f4f7d3b2538db5381/src/slave/slave.cpp#L2888
> Based on discussion with [~chhsia0] and [~bmahler],  we should take an explicit approach
of using process:: Sequence to ensure ordered task delivery (on both the master and agent).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message