mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <>
Subject [jira] [Updated] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.
Date Tue, 09 Jan 2018 00:12:00 GMT


Vinod Kone updated MESOS-8411:
    Priority: Critical  (was: Major)

> Killing a queued task can lead to the command executor never terminating.
> -------------------------------------------------------------------------
>                 Key: MESOS-8411
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Benjamin Mahler
>            Assignee: Meng Zhu
>            Priority: Critical
> If a task is killed while the executor is re-registering, we will remove it from queued
tasks and shut down the executor if all the its initial tasks could not be delivered. However,
there is a case (within {{Slave::___run}}) where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to update the
resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the killed
task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that all executors
will implement this correctly. It would be better to have a defensive policy that will shut
down an executor if all of its initial batch of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to look at the
running + terminated but unacked + completed tasks, and if empty, shut the executor down in
the {{Slave::___run}} path. This will require us to check that the completed task cache size
is set to at least 1, and this also assumes that the completed tasks are not cleared based
on time or during agent recovery.

This message was sent by Atlassian JIRA

View raw message