mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilbert Song (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
Date Sat, 20 Jan 2018 00:32:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gilbert Song reassigned MESOS-7911:
-----------------------------------

    Assignee: Gilbert Song

> Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-7911
>                 URL: https://issues.apache.org/jira/browse/MESOS-7911
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Gilbert Song
>            Priority: Critical
>              Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an agent and
that agent disconnects from the master, the master will mark those tasks LOST and remove them
from its memory. The assumption is that the agent is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient loss of connectivity
and the agent re-connects while never having terminated. This case violates our assumption
of there being no unknown tasks to the master:
> ```
> void Master::reconcileKnownSlave(
>     Slave* slave,
>     const vector<ExecutorInfo>& executors,
>     const vector<Task>& tasks)
> {
>   ...
>   // TODO(bmahler): There's an implicit assumption here the slave
>   // cannot have tasks unknown to the master. This _should_ be the
>   // case since the causal relationship is:
>   //   slave removes task -> master removes task
>   // Add error logging for any violations of this assumption!
> ```
> As a result, the tasks would remain on the agent but the master would not know about
them!
> A more appropriate action here would be:
> (1) When an agent disconnects, mark the tasks as unreachable.
>   (a) If the framework is not partition aware, only show it the last known task state.
>   (b) If the framework is partition aware, let it know that it's now unreachable.
> (2) If the agent re-connects:
>   (a) And the agent had restarted, let the non-checkpointing framework know its tasks
are GONE/LOST.
>   (b) If the agent still holds the tasks, the tasks are restored as reachable.
> (3) If the agent gets removed:
>   (a) For partition aware non-checkpointing frameworks, let them know the tasks are unreachable.
>   (b) For non partition aware non-checkpointing frameworks, let them know the tasks are
lost and kill them if the agent comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message