mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Hunt (JIRA)" <>
Subject [jira] [Commented] (MESOS-6136) Duplicate framework id handling
Date Sun, 20 Nov 2016 04:47:58 GMT


Christopher Hunt commented on MESOS-6136:

> Why is reusing the same framework ID important?

Because our custom executors can outlive their schedulers and recover from a situation where
the schedulers are completely new to them. Our executors manage a process hierarchy, which
are responsible for keeping a business up and running.

> Reusing framework IDs does not seem wise. Even after a framework has been torn down,
reusing a framework ID is not necessarily safe. Consider the following:
> ...
> Does task X belong to the "original" framework A or the new one?

Our schedulers will reconcile with their executors and decide, generally with no operator
intervention. Frameworks are in the best position to decide for their domain.

> Duplicate framework id handling
> -------------------------------
>                 Key: MESOS-6136
>                 URL:
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: DCOS 1.7 Cloud Formation scripts
>            Reporter: Christopher Hunt
>            Priority: Critical
>              Labels: framework, lifecyclemanagement, task
> We have observed a situation where Mesos will kill tasks belonging to a framework where
that framework times out with the Mesos master for some reason, perhaps even because of a
network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's tasks for
practical purposes, I'm wondering if there's an improvement where a framework shouldn't be
permitted to re-register for a given id (as now), but Mesos doesn't also kill tasks? What
I'm thinking is that Mesos could be "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed unless manually
overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on keeping
tasks running in a system, and it isn't; and b) Mesos is working as designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks where it
shouldn't be.

This message was sent by Atlassian JIRA

View raw message