aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Pan <david.p...@gmail.com>
Subject Re: Health Check Disabler Discussion
Date Mon, 13 Oct 2014 18:54:45 GMT
Sounds good to me.

On Mon, Oct 13, 2014 at 11:51 AM, Bill Farner <wfarner@apache.org> wrote:

> We had a discussion about this in our weekly community meeting in IRC
> today, and after some debate there was unanimous agreement to avoid all
> time control but to use the presence of the snooze file only.  Below is the
> excerpt from the discussion.  If you disagree, feel free to continue the
> discussion here.  Otherwise, i suggest the patch be updated to remove all
> time control.
>
> ## Health check snooze ##
> > [Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature
> > move to a dev list discussion last week.  Does anybody believe we did not
> > achieve consensus on the approach?
> > [Mon Oct 13 18:22:00 2014] <wfarner>:
> https://reviews.apache.org/r/26383/
> > [Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795
> > [Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here:
> >
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vEVAFt_QN0iL5C8OQ7pqae-3Ge5NNH6vJg4uGg@mail.gmail.com%3E
> > [Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet
> > so please voice your opinion
> > [Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a
> > snooze file, then unlink after mtime + CONSTANT_TIMEOUT"
> > [Mon Oct 13 18:23:07 2014] <wickman>: is that not correct?
> > [Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding
> > as well
> > [Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file,
> > and the health checker is disabled as long as that file is there."
> > [Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being
> > too clever in our implementation here
> > [Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s
> > what we were coalescing on.
> > [Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor
> of
> > that approach is that it's not really a snooze -- it's a sleep, and could
> > be prone to somebody forgetting to turn it off
> > [Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99
> > times out of 100, people will be snoozing so they can get the state of a
> > wedged task
> > [Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when
> > they're done
> > [Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable
> > argument either way
> > [Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn
> > [Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a
> tool
> > will take to get information about the wedged state
> > [Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be
> > more appealing
> > [Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend
> > the snooze by using touch -m?
> > [Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while
> > true; do touch .snooze; sleep 60; done;) &
> > [Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s
> > more likely (or more concerning): will someone forget to remove a snooze,
> > or forget to extend it
> > [Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file.
> > Avoiding FS mutation == Less complexity == less things to go wrong
> > [Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at
> > why you'd want to snooze in the first place
> > [Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic
> > information is lost forever
> > [Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is
> > something in a super weird state
> > [Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to
> > kill those things in the weird state when they're done
> > [Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent
> > snooze
> > [Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well
> > [Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on
> > this back to no time awareness at all?
> > [Mon Oct 13 18:28:23 2014] <kts>: +1
> > [Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1
> > [Mon Oct 13 18:28:40 2014] <zmanji>: +1
> > [Mon Oct 13 18:28:46 2014] <wfarner>: +1
>
>
>
> -=Bill
>
> On Fri, Oct 10, 2014 at 3:32 PM, Bill Farner <wfarner@apache.org> wrote:
>
> > Ignore my first response, i think gmail drafts are out to get me.
> >
> > -=Bill
> >
> > On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner <wfarner@apache.org> wrote:
> >
> >> I'm cool with #2, specifically if we do not attempt to parse the file
> and
> >> use that to determine the auto-expire time.
> >>
> >>
> >> -=Bill
> >>
> >> On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jcohen@twopensource.com>
> >> wrote:
> >>
> >>> I'm in camp #2, I don't feel that it adds a significant amount of
> >>> complexity to the health check logic, and it provides a substantial
> >>> safeguard against users accidentally shooting themselves in the foot by
> >>> accidentally leaving a health check snoozed.
> >>>
> >>> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <maxim@apache.org>
> >>> wrote:
> >>>
> >>> > +1 to the #1. Disabling health checks is like signing a waiver where
> >>> > all health check guarantees are off.
> >>> >
> >>> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <david.pan2@gmail.com>
> >>> wrote:
> >>> > > Hi Aurora,
> >>> > >
> >>> > > I am currently working on a feature that allows for health checks
> to
> >>> be
> >>> > > disabled temporarily for a running instance of a job.  The code
> >>> review
> >>> > can
> >>> > > be found at https://reviews.apache.org/r/26383/.  The idea is
that
> >>> the
> >>> > > presence of a special "snooze file" in the task's sandbox will
> >>> trigger
> >>> > the
> >>> > > disabling of the health checks.
> >>> > >
> >>> > > Currently, the code reviewers have split off into two camps:
> >>> > > 1. One set of reviewers believe that simplicity is key.  Disable
> the
> >>> > health
> >>> > > checks if the snooze file is present, enable it otherwise.
> >>> > >
> >>> > > 2. The other set of reviewers believe that there should be a snooze
> >>> > > duration.  The timer starts when the snooze file is touched. 
After
> >>> the
> >>> > > snooze duration is exhausted, the snooze file should be deleted
by
> >>> the
> >>> > > health checker, and health checks resume.  This is useful if the
> >>> process
> >>> > > that initially disabled the health checks dies unexpectedly, and
is
> >>> no
> >>> > > longer there to re-enable the health checks.
> >>> > >
> >>> > > I would like to invite anyone interested to voice your opinions
and
> >>> chime
> >>> > > in.
> >>> > >
> >>> > > Thanks,
> >>> > >
> >>> > > David Pan
> >>> >
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message