mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sargun Dhillon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
Date Thu, 20 Jul 2017 23:49:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095569#comment-16095569
] 

Sargun Dhillon commented on MESOS-7744:
---------------------------------------

[~neilc]

The task is still running. The agent, and master think the task is killed. The framework receives
TASK_KILLED. The framework "knows" due to out-of-band mechanisms the task is still alive (We
have our own mechanism outside Mesos to do reconciliation), and it resends the kill, but the
kill never gets to the executor. The Executor sends TASK_RUNNING status updates to the agent,
but these never make it to the master, nor the framework.

It occurs if the executor is already running, and the task is killed nearly immediately after
it's being started. Specifically, if when the task is on the "queue".

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> ----------------------------------------------------------------------------
>
>                 Key: MESOS-7744
>                 URL: https://issues.apache.org/jira/browse/MESOS-7744
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Sargun Dhillon
>            Priority: Minor
>              Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a TASK_STARTING
back from the agent. Under certain conditions it can result in Mesos losing track of the task.
The chunk of the logs which is interesting is here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c mesos-slave[4290]:
I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned task Titus-7590548-worker-0-4476
for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c mesos-slave[4290]:
I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task Titus-7590548-worker-0-4476 for
framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c mesos-slave[4290]:
I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task ‘Titus-7590548-worker-0-4476’
for executor ‘docker-executor’ of framework TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c mesos-slave[4290]:
I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill task Titus-7590548-worker-0-4476
of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c mesos-slave[4290]:
I0629 23:22:37.488994  5171 slave.cpp:3211] Handling status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4)
for task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c mesos-slave[4290]:
I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued task ‘Titus-7590548-worker-0-4476’
to executor ‘docker-executor’ of framework TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has already
gotten the kill update. We then send non-terminal state updates to the agent, and yet it doesn't
forward these to our framework. We're using a custom executor which is based on the older
mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message