mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected
Date Thu, 09 Oct 2014 18:18:33 GMT

    [ https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165465#comment-14165465
] 

Vinod Kone commented on MESOS-1817:
-----------------------------------

This will be fixed as part of MESOS-1799.

> Completed tasks remains in TASK_RUNNING when framework is disconnected
> ----------------------------------------------------------------------
>
>                 Key: MESOS-1817
>                 URL: https://issues.apache.org/jira/browse/MESOS-1817
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Niklas Quarfot Nielsen
>
> We have run into a problem that cause tasks which completes, when a framework is disconnected
and has a fail-over time, to remain in a running state even though the tasks actually finishes.
This hogs the cluster and gives users a inconsistent view of the cluster state. Going to the
slave, the task is finished. Going to the master, the task is still in a non-terminal state.
When the scheduler reattaches or the failover timeout expires, the tasks finishes correctly.
The current workflow of this scheduler has a long fail-over timeout, but may on the other
hand never reattach.
> Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the framework instance,
the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> When clicking on one of the slaves where, for example, task 49 runs; the slave knows
that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0
(Here task 10 to 19 are stuck in running state).
> There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> The problem turn out to be an issue with the ack-cycle of status updates:
> If the framework disconnects (with a failover timeout set), the status update manage
on the slaves will keep trying to send the front of status update stream to the master (which
in turn forwards it to the framework). If the first status update after the disconnect is
terminal, things work out fine; the master pick the terminal state up, removes the task and
release the resources.
> If, on the other hand, one non-terminal status is in the stream. The master will never
know that the task finished (or failed) before the framework reconnects.
> During a discussion on the dev mailing list (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=aWg@mail.gmail.com%3e)
we enumerated a couple of options to solve this problem.
> First off, having two ack-cycles: one between masters and slaves and one between masters
and frameworks, would be ideal. We would be able to replay the statuses in order while keeping
the master state current. However, this requires us to persist the master state in a replicated
storage.
> As a first pass, we can make sure that the tasks caught in a running state doesn't hog
the cluster when completed and the framework being disconnected.
> Here is a proof-of-concept to work out of: https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
> A new (optional) field have been added to the internal status update message:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
> Which makes it possible for the status update manager to set the field, if the latest
status was terminal: https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
> I added a test which should high-light the issue as well:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
> I would love some input on the approach before moving on.
> There are rough edges in the PoC which (of course) should be addressed before bringing
it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message