mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused
Date Mon, 08 Jan 2018 20:23:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod Kone updated MESOS-8125:
------------------------------
    Target Version/s: 1.6.0

> Agent should properly handle recovering an executor when its pid is reused
> --------------------------------------------------------------------------
>
>                 Key: MESOS-8125
>                 URL: https://issues.apache.org/jira/browse/MESOS-8125
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gastón Kleiman
>            Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is running is
rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process is assigned
the same pid that the executor had before the reboot. In this case the agent will unsuccessfully
try to reregister with the executor, and then transition it to a {{TERMINATING}} state. The
executor will sadly get stuck in that state, and the tasks that it started will get stuck
in whatever state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink under {{work_dir/meta/slaves/latest/frameworks/<framework
id>/executors/<executor id>/runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen with the
command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta directory,
e.g., {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message