mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhitao Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-8725) Support deadline for tasks
Date Fri, 23 Mar 2018 00:26:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410613#comment-16410613
] 

Zhitao Li commented on MESOS-8725:
----------------------------------

[~jamesmulcahy], we actually started on that path, however some of the scalability difficulties
we met:
 * limited compute resource on scheduler: a lot schedulers takes same design of Mesos master
and only run one active process, and tracking a timer per task there uses up precious resources
there;
 * network partition: if master/agent was under network partition, the scheduler could not
terminate the task;
 * recovery upon scheduler restart: this was the biggest problem for us, but when our scheduler
process restarted, it needed to recover "all" running tasks from database and reconstruct
what to do for each task (which is also a common pattern among schedulers). Any additional
features introduced there will further made the process heavier;
 * cheaper to implement in executor: with isolation mechanisms like `pid`, we expect that
executor has a longer lifecycle. Therefore, executors do not even need to maintain a busy
thread, but simply use a [Timer|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/timer.hpp] and
terminate the task.

> Support deadline for tasks
> --------------------------
>
>                 Key: MESOS-8725
>                 URL: https://issues.apache.org/jira/browse/MESOS-8725
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Zhitao Li
>            Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight timeline. If
any tasks in the job runs longer than x hours, it does not make sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and repeats every
Monday. If the job does not finish before next Monday for whatever reason, there is no point
to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster makes more
sense as it makes the system more scalable and also makes our centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to TaskInfo
field, and all default executors in Mesos can simply terminate the task and send a proper
*StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message