mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Peach <jpe...@apache.org>
Subject Re: API review: max_duration on TaskInfo
Date Wed, 28 Mar 2018 17:15:45 GMT


> On Mar 23, 2018, at 2:21 PM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:
> 
> Hi everyone,
> 
> I'd like to do an API review for MESOS-8725
> <https://issues.apache.org/jira/browse/MESOS-8725>. We are adding an
> optional `max_duration` to `TaskInfo` field. If a task does not terminate
> within this duration, built-in executors will kill the task with a new
> reason `REASON_MAX_DURATION_REACHED`.
> 
> Proof of concept patch:
> https://reviews.apache.org/r/66258/
> 
> Reference implementation in command executor:
> https://reviews.apache.org/r/66259/
> 
> A design choice we made is to make this relative duration rather than an
> absolute timestamp of deadline. Our rationales:
> 
>   - Cluster could suffer from clock skews, so same absolute deadline would
>   result in inconsistent behavior;
>   - Framework can just trivially translate its own clock as source of
>   truth to translate absolute deadline to current time + max_duration.
> 
> Please let me know what you think. Thanks.

Bringing our conversation about task group semantics back to the list.

The current reviews require all tasks in a group to have the same max_duration. This is equivalent
to specifying max_duration on the task group itself. This means that when the time is up,
the whole group gets torn down. Validation on the master ensures that schedulers have to set
the same value across all the tasks.

Alternatively, we could allow the duration to be different for tasks and then just kill the
individual task when it's time expires. In this case, the task will have a final status of
TASK_KILLED, which will cause the Mesos default executor to tear down the rest of the group.
So we have the same effect, though it is expressed differently in the API.

So maybe the cleanest way to express this for task groups is to place the max_duration in
the `TaskGroupInfo`? However if we do that, then we lose any information about which task
exceeded the duration (since by definition they all did). So I'm leaning towards allowing
a per-task max_duration.

We should also define what this API means for the final `TaskStatus` of the task. In my executor,
the rule we follow is that `TASK_KILLED` is only ever used in response to explicit KILL requests
from the scheduler. If the max_duration is exceeded, I think that we should classify that
as `TASK_FAILED`.

thanks,
James
Mime
View raw message