mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mao Geng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-5482) mesos/marathon task stuck in staging after slave reboot
Date Mon, 11 Sep 2017 23:34:02 GMT

    [ https://issues.apache.org/jira/browse/MESOS-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162232#comment-16162232
] 

Mao Geng commented on MESOS-5482:
---------------------------------

[~chhsia0] the problem happened on agent lost connection with master and re-registered, no
one was really shutting down marathon. MESOS-7215 look like the root cause. 
When agent re-registered, it was shutting down all executors of non partition-aware frameworks,
including the marathon task. Meanwhile marathon tried to lunch a new task on the agent, and
the agent ignored running the task as it thought the framework was shutting down, hence the
task got stuck in the "staging" stage. Then marathon tried to kill the task as the task is
overdue on deployment, which got ignored by the agent too. 
Restarting the agent resolves this issue though.

> mesos/marathon task stuck in staging after slave reboot
> -------------------------------------------------------
>
>                 Key: MESOS-5482
>                 URL: https://issues.apache.org/jira/browse/MESOS-5482
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: lutful karim
>              Labels: tech-debt
>         Attachments: marathon-mesos-masters_after-reboot.log, mesos-masters_mesos.log,
mesos_slaves_after_reboot.log, tasks_running_before_rebooot.marathon
>
>
> The main idea of mesos/marathon is to sleep well, but after node reboot mesos task gets
stuck in staging for about 4 hours.
> To reproduce the issue: 
> - setup a mesos cluster in HA mode with systemd enabled mesos-master and mesos-slave
service.
> - run docker registry (https://hub.docker.com/_/registry/ ) with mesos constraint (hostname:LIKE:mesos-slave-1)
in one node. Reboot the node and notice that task getting stuck in staging.
> Possible workaround: service mesos-slave restart fixes the issue.
> OS: centos 7.2
> mesos version: 0.28.1
> marathon: 1.1.1
> zookeeper: 3.4.8
> docker: 1.9.1 dockerAPIversion: 1.21
> error message:
> May 30 08:38:24 euca-10-254-237-140 mesos-slave[832]: W0530 08:38:24.120013   909 slave.cpp:2018]
Ignoring kill task docker-registry.066fb448-2628-11e6-bedd-d00d0ef81dc3 because the executor
'docker-registry.066fb448-2628-11e6-bedd-d00d0ef81dc3' of framework 8517fcb7-f2d0-47ad-ae02-837570bef929-0000
is terminating/terminated



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message