Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 59F931152B for ; Mon, 21 Apr 2014 22:10:34 +0000 (UTC) Received: (qmail 78926 invoked by uid 500); 21 Apr 2014 22:10:33 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 78861 invoked by uid 500); 21 Apr 2014 22:10:33 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 78853 invoked by uid 99); 21 Apr 2014 22:10:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 22:10:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of spodila@netflix.com designates 209.85.216.174 as permitted sender) Received: from [209.85.216.174] (HELO mail-qc0-f174.google.com) (209.85.216.174) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 22:10:28 +0000 Received: by mail-qc0-f174.google.com with SMTP id c9so4613867qcz.33 for ; Mon, 21 Apr 2014 15:10:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=NuwHOcggTbFIUOJsmvvmSncyKcFR3fVyw2biNiPOHNE=; b=qh4xygxKKngIg2Hi3Ovc32URYvZGisVd4NqmVf1C7ND69cIapmLweXipQmu3BaH8ih 8E/PUQ9SFpJ2YB06P3iWlsBgWc+xjUpF0rWxZCnDmyyWzwyUak6P2MXsYHGwpoerQpDR 3KaQnOkyEhPLRI0cqIX7NEhJOA6IVDD44XTf8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=NuwHOcggTbFIUOJsmvvmSncyKcFR3fVyw2biNiPOHNE=; b=lH6rBK9SG+2S0onirVvnontOYnM1C/1QIBxdqdLbAgpeixN2mhWgu7H1Ue99jyargs 1FSQtbTq4Y0IRJsyC7ReYBgF6yC5rdpgxftd3TIcwzi+1vzh1re3N754MrGBm6ItoMt3 S78UJsxVJd4QJLXFii2+Xmkq48WdprxKPvaGYvTKhsZEUUFZ0JPMStI7TApzDX8u5b+x W0QV4ZdqRzx+/Pv0Pav2gkqbNZTkvit0Ndr8nQDTlZxk7nhMeccMxaaFF/I4pUw/ULfT zZzjOZT07l9jrg4Nm+34wjtAaTChxcgFIysEX2mt/XIvlm1nHM3UfLrl64dT6TpOADAV XjYQ== X-Gm-Message-State: ALoCoQk6wPI+KiQaO5e6twQRL5Mjf/3B/dSfFt8TlxjLZK7Nrm3GAjJmVSfNUxcPSgO5UQ1IkaVt MIME-Version: 1.0 X-Received: by 10.229.221.194 with SMTP id id2mr43613352qcb.5.1398118205140; Mon, 21 Apr 2014 15:10:05 -0700 (PDT) Received: by 10.229.159.148 with HTTP; Mon, 21 Apr 2014 15:10:05 -0700 (PDT) In-Reply-To: References: Date: Mon, 21 Apr 2014 15:10:05 -0700 Message-ID: Subject: Re: What happens if a scheduler registers with a framework ID that hasn't been used in 48 hours? From: Sharma Podila To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a11343920f8e46104f794c2f5 X-Virus-Checked: Checked by ClamAV on apache.org --001a11343920f8e46104f794c2f5 Content-Type: text/plain; charset=UTF-8 On a related note, what if framework scheduler is up while Mesos master goes down. Then, if Mesos master restarts after a time interval greater than framework failover timeout, what is the expected behavior? Would the framework successfully get a re-registered() callback? Or error() callback? Other? On Fri, Apr 18, 2014 at 10:54 AM, Vinod Kone wrote: > I think you are on the right track here. > > I would recommend setting a high failover timeout that is an upper bound > for all of your schedulers being down (e.g., 1 week). This way, even if all > your scheduler instances are down due to outage/maintenance, your > tasks/services keep running in the Mesos cluster. > > > On Fri, Apr 18, 2014 at 5:02 AM, David Greenberg wrote: > >> Hey Vinod, >> The problem I'm trying to solve is writing a framework that can run on >> our HA application cluster, and whenever the framework's current scheduler >> dies, another node will be elected and take over. I'm trying to work >> through the various failure cases to understand how implement this so that >> it works through all the failure cases I can think of. >> >> It sounds like the solution that'd work best for me would be to try to >> read the framework ID from a known location and register with that. If it's >> not there, or if registration fails, then the framework should register >> anew. >> >> This framework's state is very large, and resides in a couple databases, >> so that even if the entire set of candidates for becoming the framework is >> down for the whole failover grave period, the framework still wants to >> register, since it's state never gets invalidated. >> >> Thanks, >> David >> >> >> On Thursday, April 17, 2014, Vinod Kone wrote: >> >>> >>> On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg >> > wrote: >>> >>>> My follow-up question is this--is there a way to tell whether I'm >>>> outside of the timeout window? I'd like to have my framework check ZK and >>>> determine whether it's w/in the framework timeout or not, so that it can >>>> make the correct call. >>>> >>> >>> Hey David, >>> >>> Currently, the only signal you can get is by hitting "/state.json" >>> endpoint on the master. The framework should've been moved to >>> 'completed_frameworks' after the failover timeout. Of course, if a master >>> fails over this information is lost so you can't reliably depend on it. >>> >>> When master starts storing persistent state about frameworks (likely >>> couple of releases away), a re-registration attempt in such a case would be >>> denied by the master. So that could be your signal. Alternatively, with >>> persistence, you could also more reliably depend on "/state.json" to get >>> this info. >>> >>> To take a step back, what is the problem you are trying to solve? >>> >>> Thanks, >>> >> > --001a11343920f8e46104f794c2f5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On a related note, what if framework scheduler is up while M= esos master goes down. Then, if Mesos master restarts after a time interval= greater than framework failover timeout, what is the expected behavior? Wo= uld the framework successfully get a re-registered() callback? Or error() c= allback? Other?=C2=A0


On Fri,= Apr 18, 2014 at 10:54 AM, Vinod Kone <vinodkone@gmail.com> wrote:
I think you are on the righ= t track here.

I would recommend setting a high failover = timeout that is an upper bound for all of your schedulers being down (e.g.,= 1 week). This way, even if all your scheduler instances are down due to ou= tage/maintenance, your tasks/services keep running in the Mesos cluster.

On Fri, Apr 18, 2014 at 5:02 AM, David Gre= enberg <dsg123456789@gmail.com> wrote:
Hey Vinod,
The problem I'm trying to= solve is writing a framework that can run on our HA application cluster, a= nd whenever the framework's current scheduler dies, another node will b= e elected and take over. I'm trying to work through the various failure= cases to understand how implement this so that it works through all the fa= ilure cases I can think of.=C2=A0

It sounds like the solution that'd work best for me= would be to try to read the framework ID from a known location and registe= r with that. If it's not there, or if registration fails, then the fram= ework should register anew.=C2=A0

This framework's state is very large, and=C2=A0resi= des in a couple=C2=A0databases, so that even if the entire set of candidate= s for becoming the framework is down for the whole failover grave period, t= he framework still wants to register, since it's state never gets inval= idated.

Thanks,
David=C2=A0

On Thursday, April 17, 2014, Vinod Kone <vinodkone@gmail.com> wrote:

= On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg <d= sg123456789@gmail.com> wrote:
My follow-up question is th= is--is there a way to tell whether I'm outside of the timeout window? I= 'd like to have my framework check ZK and determine whether it's w/= in the framework timeout or not, so that it can make the correct call.

Hey David,

Currently, the only signa= l you can get is by hitting "/state.json" endpoint on the master.= The framework should've been moved to 'completed_frameworks' a= fter the failover timeout. Of course, if a master fails over this informati= on is lost so you can't reliably depend on it.

When master= starts storing persistent state about frameworks (likely couple of release= s away), a re-registration attempt in such a case would be denied by the ma= ster. So that could be your signal. Alternatively, with persistence, you co= uld also more reliably depend on "/state.json" to get this info.<= /div>

To take a s= tep back, what is the problem you are trying to solve?

Thanks,


--001a11343920f8e46104f794c2f5--