Mailing-List: contact user-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mesos.apache.org
Received-SPF: pass (nike.apache.org: domain of spodila@netflix.com designates
 209.85.216.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAkWvAxRv2fukGqmTnxLXmYyu4emzAyoqim7A0k3xOOM7WMmwQ@mail.gmail.com>
References: 
 <CAA6W+3R1mT_TkzXGTxf0iGH4X9esNj=Zs6NxF_FSw+MwKSv8Gg@mail.gmail.com>
	<CAK8jAgPhm7x6YvYXSDUuwxqFE9aGonM7ArY5+yW6Vv7mi5tyHg@mail.gmail.com>
	<CAA6W+3QhGcc3GrGufQhUXH=ynm4p_tVOmU=JdsooJiAWgEDt-w@mail.gmail.com>
	<CAAkWvAwmsw1_Wy+tGVvVMTD5U6Rb5y9fDDyL9sX-3cXBQYP29g@mail.gmail.com>
	<CAA6W+3QhaOpxyfxFOcSFAhGJz9C9ypKgyLF=aRCD+Lmb_e0E8g@mail.gmail.com>
	<CAAkWvAxRv2fukGqmTnxLXmYyu4emzAyoqim7A0k3xOOM7WMmwQ@mail.gmail.com>
Date: Mon, 21 Apr 2014 15:10:05 -0700
Message-ID: 
 <CABXxnLu1T4iA_6KXTaR4zFVMxJ3pgYu0QQJQVarywuCP4uDtvA@mail.gmail.com>
Subject: Re: What happens if a scheduler registers with a framework ID that
 hasn't been used in 48 hours?
From: Sharma Podila <spodila@netflix.com>
To: user@mesos.apache.org
Content-Type: multipart/alternative; boundary=001a11343920f8e46104f794c2f5

--001a11343920f8e46104f794c2f5
Content-Type: text/plain; charset=UTF-8

On a related note, what if framework scheduler is up while Mesos master
goes down. Then, if Mesos master restarts after a time interval greater
than framework failover timeout, what is the expected behavior? Would the
framework successfully get a re-registered() callback? Or error() callback?
Other?


On Fri, Apr 18, 2014 at 10:54 AM, Vinod Kone <vinodkone@gmail.com> wrote:

> I think you are on the right track here.
>
> I would recommend setting a high failover timeout that is an upper bound
> for all of your schedulers being down (e.g., 1 week). This way, even if all
> your scheduler instances are down due to outage/maintenance, your
> tasks/services keep running in the Mesos cluster.
>
>
> On Fri, Apr 18, 2014 at 5:02 AM, David Greenberg <dsg123456789@gmail.com>wrote:
>
>> Hey Vinod,
>> The problem I'm trying to solve is writing a framework that can run on
>> our HA application cluster, and whenever the framework's current scheduler
>> dies, another node will be elected and take over. I'm trying to work
>> through the various failure cases to understand how implement this so that
>> it works through all the failure cases I can think of.
>>
>> It sounds like the solution that'd work best for me would be to try to
>> read the framework ID from a known location and register with that. If it's
>> not there, or if registration fails, then the framework should register
>> anew.
>>
>> This framework's state is very large, and resides in a couple databases,
>> so that even if the entire set of candidates for becoming the framework is
>> down for the whole failover grave period, the framework still wants to
>> register, since it's state never gets invalidated.
>>
>> Thanks,
>> David
>>
>>
>> On Thursday, April 17, 2014, Vinod Kone <vinodkone@gmail.com> wrote:
>>
>>>
>>> On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg <dsg123456789@gmail.com
>>> > wrote:
>>>
>>>> My follow-up question is this--is there a way to tell whether I'm
>>>> outside of the timeout window? I'd like to have my framework check ZK and
>>>> determine whether it's w/in the framework timeout or not, so that it can
>>>> make the correct call.
>>>>
>>>
>>> Hey David,
>>>
>>> Currently, the only signal you can get is by hitting "/state.json"
>>> endpoint on the master. The framework should've been moved to
>>> 'completed_frameworks' after the failover timeout. Of course, if a master
>>> fails over this information is lost so you can't reliably depend on it.
>>>
>>> When master starts storing persistent state about frameworks (likely
>>> couple of releases away), a re-registration attempt in such a case would be
>>> denied by the master. So that could be your signal. Alternatively, with
>>> persistence, you could also more reliably depend on "/state.json" to get
>>> this info.
>>>
>>> To take a step back, what is the problem you are trying to solve?
>>>
>>> Thanks,
>>>
>>
>

--001a11343920f8e46104f794c2f5
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:times ne=
w roman,serif">On a related note, what if framework scheduler is up while M=
esos master goes down. Then, if Mesos master restarts after a time interval=
 greater than framework failover timeout, what is the expected behavior? Wo=
uld the framework successfully get a re-registered() callback? Or error() c=
allback? Other?=C2=A0</div>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Fri,=
 Apr 18, 2014 at 10:54 AM, Vinod Kone <span dir=3D"ltr">&lt;<a href=3D"mail=
to:vinodkone@gmail.com" target=3D"_blank">vinodkone@gmail.com</a>&gt;</span=
> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I think you are on the righ=
t track here.<div><br></div><div>I would recommend setting a high failover =
timeout that is an upper bound for all of your schedulers being down (e.g.,=
 1 week). This way, even if all your scheduler instances are down due to ou=
tage/maintenance, your tasks/services keep running in the Mesos cluster.</d=
iv>


</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Fri, Apr 18, 2014 at 5:02 AM, David Gre=
enberg <span dir=3D"ltr">&lt;<a href=3D"mailto:dsg123456789@gmail.com" targ=
et=3D"_blank">dsg123456789@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hey Vinod,<div>The problem I&#39;m trying to=
 solve is writing a framework that can run on our HA application cluster, a=
nd whenever the framework&#39;s current scheduler dies, another node will b=
e elected and take over. I&#39;m trying to work through the various failure=
 cases to understand how implement this so that it works through all the fa=
ilure cases I can think of.=C2=A0</div>


<div><br></div><div>It sounds like the solution that&#39;d work best for me=
 would be to try to read the framework ID from a known location and registe=
r with that. If it&#39;s not there, or if registration fails, then the fram=
ework should register anew.=C2=A0</div>


<div><br></div><div>This framework&#39;s state is very large, and=C2=A0resi=
des in a couple=C2=A0databases, so that even if the entire set of candidate=
s for becoming the framework is down for the whole failover grave period, t=
he framework still wants to register, since it&#39;s state never gets inval=
idated.</div>


<div><br></div><div>Thanks,</div><div>David=C2=A0<div><div><span></span><br=
><br>On Thursday, April 17, 2014, Vinod Kone &lt;<a href=3D"mailto:vinodkon=
e@gmail.com" target=3D"_blank">vinodkone@gmail.com</a>&gt; wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">=
On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg <span dir=3D"ltr">&lt;<a>d=
sg123456789@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">My follow-up question is th=
is--is there a way to tell whether I&#39;m outside of the timeout window? I=
&#39;d like to have my framework check ZK and determine whether it&#39;s w/=
in the framework timeout or not, so that it can make the correct call.</div=
>


<div><div></div></div></blockquote></div><br>Hey David,</div><div class=3D"=
gmail_extra"><br></div><div class=3D"gmail_extra">Currently, the only signa=
l you can get is by hitting &quot;/state.json&quot; endpoint on the master.=
 The framework should&#39;ve been moved to &#39;completed_frameworks&#39; a=
fter the failover timeout. Of course, if a master fails over this informati=
on is lost so you can&#39;t reliably depend on it.</div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">When master=
 starts storing persistent state about frameworks (likely couple of release=
s away), a re-registration attempt in such a case would be denied by the ma=
ster. So that could be your signal. Alternatively, with persistence, you co=
uld also more reliably depend on &quot;/state.json&quot; to get this info.<=
/div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">To take a s=
tep back, what is the problem you are trying to solve?</div><div class=3D"g=
mail_extra"><br></div><div class=3D"gmail_extra">Thanks,</div></div>
</blockquote></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11343920f8e46104f794c2f5--