Mailing-List: contact user-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mesos.apache.org
Received-SPF: pass (athena.apache.org: domain of tahasam@gmail.com designates
 209.85.220.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADVt52wez=XU_X83fFmDQQzNaf3koaFKmdciNBfmS75F+S_Vwg@mail.gmail.com>
References: 
 <CADVt52woSWmvvQp5i_i9_G1+-rRvTsO+JjU3PaZQr1OC4mY0bg@mail.gmail.com>
 <CADK=YxsXHxXjK9T64dn20VK4SA1bXMcLSvpVUvysEvj9ojeM4A@mail.gmail.com>
 <CADVt52wez=XU_X83fFmDQQzNaf3koaFKmdciNBfmS75F+S_Vwg@mail.gmail.com>
From: Sam Taha <tahasam@gmail.com>
Date: Fri, 27 Sep 2013 15:40:26 -0400
Message-ID: 
 <CAOC0Rpn5wGKhpiWzy-MQ0rdAX7sLJnL4pejnA01NbcpMS_k3ew@mail.gmail.com>
Subject: Re: Aurora, Marathon and long lived job frameworks
To: user@mesos.apache.org
Content-Type: multipart/alternative; boundary=e89a8ffbad61b333eb04e762a94f

--e89a8ffbad61b333eb04e762a94f
Content-Type: text/plain; charset=ISO-8859-1

While still in active development, I expect JobServer to match some of the
criteria you describe once Mesos integration is complete. It currently
supports these features for static node clusters. With mesos integration,
it will have dynamic clustering capability while still retaining the
enterprise type job scheduling/monitoring/tracking...etc features.

Thanks,
Sam Taha

http://www.grandlogic.com


On Fri, Sep 27, 2013 at 12:59 PM, Dan Colish <dcolish@urbanairship.com>wrote:

>
> On Fri, Sep 27, 2013 at 9:04 AM, Damien Hardy <dhardy@viadeoteam.com>wrote:
>
>> Hello,
>>
>> What about chronos http://airbnb.github.io/chronos/
>>
>>
> Yes, I evaluated chronos and it was not clear to me how it matches my
> selection criteria. It might be my unfamiliarity with the framework rather
> than a lack of features. Is there anyone who could elaborate more?
>
>
>> Best regards,
>>
>>
>> 2013/9/27 Dan Colish <dcolish@urbanairship.com>
>>
>>> I have been working on an internal project for executing a large number
>>> of jobs across a cluster for the past couple of months and I am currently
>>> doing a spike on using mesos for some of the cluster management tasks. The
>>> clear prior art winners are Aurora and Marathon, but in both cases they
>>> fall short of what I need.
>>>
>>> In aurora's case, the software is clearly very early in the open
>>> sourcing process and as a result it missing significant pieces. The biggest
>>> missing piece is the actual execution framework, Thermos. [That is what I
>>> assume thermos does. I have no internal knowledge to verify that
>>> assumption] Additionally, Aurora is heavily optimized for a high user count
>>> and large number of incoming jobs. My use case is much simpler. There is
>>> only one effective user and we have a small known set of jobs which need to
>>> run.
>>>
>>> On the other hand, Marathon is not designed for job execution if job is
>>> defined to be a smaller unit of work. Instead, Marathon self-describes as a
>>> meta-framework for deploying frameworks to a mesos cluster. A job to
>>> marathon is the framework that runs. I do not think Marathon would be a
>>> good fit for managing the my task execution and retry logic. It is designed
>>> to run at on as a sub-layer of the cluster's resource allocation scheduler
>>> and its abstractions follow suit.
>>>
>>> For my needs Aurora does appear to be a much closer fit than Marathon,
>>> but neither is ideal. Since that is the case, I find myself left with a
>>> rough choice. I am not thrilled with the prospect of yet another framework
>>> for Mesos, but there is a lot of work which I have already completed for my
>>> internal project that would need to reworked to fit with Aurora. Currently
>>> my project can support the following features.
>>>
>>> * Distributed job locking - jobs cannot overlap
>>> * Job execution delay queue - jobs can be run immediately or after a
>>> delay
>>> * Job preemption
>>> * Job success/failure tracking
>>> * Garbage collection of dead jobs
>>> * Job execution failover - job is retried on a new executor
>>> * Executor warming - min # of executors idle
>>> * Executor limits - max # of executors available
>>>
>>> My plan for integration with mesos is to adapt the job manager into a
>>> mesos scheduler and my execution slaves into a mesos executor. At that
>>> point, my framework will be able to run on the mesos cluster, but I have a
>>> few concerns about how to allocated and release resources that the
>>> executors will use over the lifetime of the cluster. I am not sure whether
>>> it is better to be greedy early on in the frameworks life-cycle or to
>>> decline resources initially and scale the framework's slaves when jobs
>>> start coming in. Additionally, the relationship between the executor and
>>> its associated driver are not immediately clear to me. If I am reading the
>>> code correctly, they do not provide a way to stop a task in progress short
>>> of killing the executor process.
>>>
>>> I think that mesos will be a nice feature to add to my project and I
>>> would really appreciate any feedback from the community. I will provide
>>> progress updates as I continue work on my experiments.
>>>
>>
>>
>>
>> --
>> Damien HARDY
>>
>
>

--e89a8ffbad61b333eb04e762a94f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">While still in active development, I expect JobServer to m=
atch some of the criteria you describe once Mesos integration is complete. =
It currently supports these features for static node clusters. With mesos i=
ntegration, it will have dynamic clustering capability while still retainin=
g the enterprise type job scheduling/monitoring/tracking...etc features.<di=
v>

<div><br></div><div>Thanks,</div><div>Sam Taha</div><div>
<br></div><div><a href=3D"http://www.grandlogic.com" target=3D"_blank">http=
://www.grandlogic.com</a></div></div></div><div class=3D"gmail_extra"><br><=
br><div class=3D"gmail_quote">On Fri, Sep 27, 2013 at 12:59 PM, Dan Colish =
<span dir=3D"ltr">&lt;<a href=3D"mailto:dcolish@urbanairship.com" target=3D=
"_blank">dcolish@urbanairship.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><br><div class=3D"gmail_ext=
ra"><div class=3D"gmail_quote"><div class=3D"im">On Fri, Sep 27, 2013 at 9:=
04 AM, Damien Hardy <span dir=3D"ltr">&lt;<a href=3D"mailto:dhardy@viadeote=
am.com" target=3D"_blank">dhardy@viadeoteam.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Hello,<br><br></d=
iv>What about chronos <a href=3D"http://airbnb.github.io/chronos/" target=
=3D"_blank">http://airbnb.github.io/chronos/</a><br>


<br></div></div></blockquote><div><br></div></div><div>Yes, I evaluated chr=
onos and it was not clear to me how it matches my selection criteria. It mi=
ght be my unfamiliarity with the framework rather than a lack of features. =
Is there anyone who could elaborate more?=A0</div>

<div><div class=3D"h5">
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div></div>Be=
st regards,<br><div class=3D"gmail_extra"><div><div><br><br><div class=3D"g=
mail_quote">


2013/9/27 Dan Colish <span dir=3D"ltr">&lt;<a href=3D"mailto:dcolish@urbana=
irship.com" target=3D"_blank">dcolish@urbanairship.com</a>&gt;</span><br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex">


<div dir=3D"ltr">I have been working on an internal project for executing a=
 large number of jobs across a cluster for the past couple of months and I =
am currently doing a spike on using mesos for some of the cluster managemen=
t tasks. The clear prior art winners are Aurora and Marathon, but in both c=
ases they fall short of what I need.=A0<div>


<br></div><div>In aurora&#39;s case, the software is clearly very early in =
the open sourcing process and as a result it missing significant pieces. Th=
e biggest missing piece is the actual execution framework, Thermos. [That i=
s what I assume thermos does. I have no internal knowledge to verify that a=
ssumption] Additionally, Aurora is heavily optimized for a high user count =
and large number of incoming jobs. My use case is much simpler. There is on=
ly one effective user and we have a small known set of jobs which need to r=
un.=A0</div>


<div><br></div><div>On the other hand, Marathon is not designed for job exe=
cution if job is defined to be a smaller unit of work. Instead, Marathon se=
lf-describes as a meta-framework for deploying frameworks to a mesos cluste=
r. A job to marathon is the framework that runs. I do not think Marathon wo=
uld be a good fit for managing the my task execution and retry logic. It is=
 designed to run at on as a sub-layer of the cluster&#39;s resource allocat=
ion scheduler and its abstractions follow suit.</div>


<div><br></div><div>For my needs Aurora does appear to be a much closer fit=
 than Marathon, but neither is ideal. Since that is the case, I find myself=
 left with a rough choice. I am not thrilled with the prospect of yet anoth=
er framework for Mesos, but there is a lot of work which I have already com=
pleted for my internal project that would need to reworked to fit with Auro=
ra. Currently my project can support the following features.</div>


<div><br></div><div>* Distributed job locking - jobs cannot overlap=A0</div=
><div>* Job execution delay queue - jobs can be run immediately or after a =
delay</div><div>* Job preemption</div><div>* Job success/failure tracking</=
div>


<div>* Garbage collection of dead jobs</div><div>* Job execution failover -=
 job is retried on a new executor</div>
<div>* Executor warming - min # of executors idle</div><div>* Executor limi=
ts - max # of executors available</div><div><br></div><div>My plan for inte=
gration with mesos is to adapt the job manager into a mesos scheduler and m=
y execution slaves into a mesos executor. At that point, my framework will =
be able to run on the mesos cluster, but I have a few concerns about how to=
 allocated and release resources that the executors will use over the lifet=
ime of the cluster. I am not sure whether it is better to be greedy early o=
n in the frameworks life-cycle or to decline resources initially and scale =
the framework&#39;s slaves when jobs start coming in. Additionally, the rel=
ationship between the executor and its associated driver are not immediatel=
y clear to me. If I am reading the code correctly, they do not provide a wa=
y to stop a task in progress short of killing the executor process.=A0</div=
>


<div><br></div><div>I think that mesos will be a nice feature to add to my =
project and I would really appreciate any feedback from the community. I wi=
ll provide progress updates as I continue work on my experiments.</div>


</div>
</blockquote></div><br><br clear=3D"all"><br></div></div><span><font color=
=3D"#888888">-- <br><div dir=3D"ltr"><div><font color=3D"#FF9966" face=3D"t=
ahoma, sans-serif">Damien HARDY<br></font></div></div>
</font></span></div></div>
</blockquote></div></div></div><br></div></div>
</blockquote></div><br></div>

--e89a8ffbad61b333eb04e762a94f--