mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Hardy <>
Subject Re: Aurora, Marathon and long lived job frameworks
Date Fri, 27 Sep 2013 16:04:25 GMT

What about chronos

Best regards,

2013/9/27 Dan Colish <>

> I have been working on an internal project for executing a large number of
> jobs across a cluster for the past couple of months and I am currently
> doing a spike on using mesos for some of the cluster management tasks. The
> clear prior art winners are Aurora and Marathon, but in both cases they
> fall short of what I need.
> In aurora's case, the software is clearly very early in the open sourcing
> process and as a result it missing significant pieces. The biggest missing
> piece is the actual execution framework, Thermos. [That is what I assume
> thermos does. I have no internal knowledge to verify that assumption]
> Additionally, Aurora is heavily optimized for a high user count and large
> number of incoming jobs. My use case is much simpler. There is only one
> effective user and we have a small known set of jobs which need to run.
> On the other hand, Marathon is not designed for job execution if job is
> defined to be a smaller unit of work. Instead, Marathon self-describes as a
> meta-framework for deploying frameworks to a mesos cluster. A job to
> marathon is the framework that runs. I do not think Marathon would be a
> good fit for managing the my task execution and retry logic. It is designed
> to run at on as a sub-layer of the cluster's resource allocation scheduler
> and its abstractions follow suit.
> For my needs Aurora does appear to be a much closer fit than Marathon, but
> neither is ideal. Since that is the case, I find myself left with a rough
> choice. I am not thrilled with the prospect of yet another framework for
> Mesos, but there is a lot of work which I have already completed for my
> internal project that would need to reworked to fit with Aurora. Currently
> my project can support the following features.
> * Distributed job locking - jobs cannot overlap
> * Job execution delay queue - jobs can be run immediately or after a delay
> * Job preemption
> * Job success/failure tracking
> * Garbage collection of dead jobs
> * Job execution failover - job is retried on a new executor
> * Executor warming - min # of executors idle
> * Executor limits - max # of executors available
> My plan for integration with mesos is to adapt the job manager into a
> mesos scheduler and my execution slaves into a mesos executor. At that
> point, my framework will be able to run on the mesos cluster, but I have a
> few concerns about how to allocated and release resources that the
> executors will use over the lifetime of the cluster. I am not sure whether
> it is better to be greedy early on in the frameworks life-cycle or to
> decline resources initially and scale the framework's slaves when jobs
> start coming in. Additionally, the relationship between the executor and
> its associated driver are not immediately clear to me. If I am reading the
> code correctly, they do not provide a way to stop a task in progress short
> of killing the executor process.
> I think that mesos will be a nice feature to add to my project and I would
> really appreciate any feedback from the community. I will provide progress
> updates as I continue work on my experiments.

Damien HARDY

View raw message