mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (MESOS-525) Slave should kill tasks when disconnected from the master for longer than the health check timeout.
Date Tue, 16 Jul 2013 00:16:49 GMT

     [ https://issues.apache.org/jira/browse/MESOS-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Mahler resolved MESOS-525.
-----------------------------------

    Resolution: Won't Fix
    
> Slave should kill tasks when disconnected from the master for longer than the health
check timeout.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-525
>                 URL: https://issues.apache.org/jira/browse/MESOS-525
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> The following scenario was observed in production at Twitter:
> 1. Task T beings running on a slave at
> I0618 02:54:38.069694 15362 slave.cpp:830] Status update: task T of framework F is now
in state TASK_RUNNING
> 2. Due to a network partition, the slave is removed from the master for failing health
checks:
> W0618 23:56:18.063217 28745 master.cpp:1172] Removing slave 201304011727-2230002186-5050-28738-3217
at S:5051 because it has been deactivated
> I0618 23:56:18.068821 28745 master.cpp:1181] Master now considering a slave at S:5051
as inactive
> 3. The task stayed running on the partitioned slave for 6 days! Until a user manually
killed the process and the executor marked it as finished:
> I0624 20:20:57.565053 15380 slave.cpp:830] Status update: task 1371524058397-ads-adshard-production-153-a4504eb0-384b-4600-b6fe-e080c87bd84e
of framework 201104070004-0000002563-0000 is now in state TASK_FINISHED
> There are a few ways to fix this in the slave, these rely on the fact that the master
will have marked the tasks as LOST when it removed the slave, after which point we don't want
the tasks to continue running.
>   1. Have the slave commit suicide after (<health_check_failure_timeout> + buffer)
amount of time of disconnection with the master. This only works well when cgroups is in use
to ensure the next run of the slave cleans up properly. And this gets messier with slave recovery.
>   2. A cleaner approach would be to have the slave kill all executors running under it.
We most likely want to send TASK_LOST updates for the tasks although this will mean duplicate
updates unless the master handles these correctly. Alternatively, we can avoid sending any
updates, but we'll need to guarantee that the updates were sent by the master.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message