aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Health Check Disabler Discussion
Date Mon, 13 Oct 2014 18:51:00 GMT
We had a discussion about this in our weekly community meeting in IRC
today, and after some debate there was unanimous agreement to avoid all
time control but to use the presence of the snooze file only.  Below is the
excerpt from the discussion.  If you disagree, feel free to continue the
discussion here.  Otherwise, i suggest the patch be updated to remove all
time control.

## Health check snooze ##
> [Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature
> move to a dev list discussion last week.  Does anybody believe we did not
> achieve consensus on the approach?
> [Mon Oct 13 18:22:00 2014] <wfarner>: https://reviews.apache.org/r/26383/
> [Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795
> [Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here:
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vEVAFt_QN0iL5C8OQ7pqae-3Ge5NNH6vJg4uGg@mail.gmail.com%3E
> [Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet
> so please voice your opinion
> [Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a
> snooze file, then unlink after mtime + CONSTANT_TIMEOUT"
> [Mon Oct 13 18:23:07 2014] <wickman>: is that not correct?
> [Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding
> as well
> [Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file,
> and the health checker is disabled as long as that file is there."
> [Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being
> too clever in our implementation here
> [Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s
> what we were coalescing on.
> [Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor of
> that approach is that it's not really a snooze -- it's a sleep, and could
> be prone to somebody forgetting to turn it off
> [Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99
> times out of 100, people will be snoozing so they can get the state of a
> wedged task
> [Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when
> they're done
> [Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable
> argument either way
> [Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn
> [Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a tool
> will take to get information about the wedged state
> [Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be
> more appealing
> [Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend
> the snooze by using touch -m?
> [Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while
> true; do touch .snooze; sleep 60; done;) &
> [Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s
> more likely (or more concerning): will someone forget to remove a snooze,
> or forget to extend it
> [Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file.
> Avoiding FS mutation == Less complexity == less things to go wrong
> [Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at
> why you'd want to snooze in the first place
> [Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic
> information is lost forever
> [Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is
> something in a super weird state
> [Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to
> kill those things in the weird state when they're done
> [Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent
> snooze
> [Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well
> [Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on
> this back to no time awareness at all?
> [Mon Oct 13 18:28:23 2014] <kts>: +1
> [Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1
> [Mon Oct 13 18:28:40 2014] <zmanji>: +1
> [Mon Oct 13 18:28:46 2014] <wfarner>: +1



-=Bill

On Fri, Oct 10, 2014 at 3:32 PM, Bill Farner <wfarner@apache.org> wrote:

> Ignore my first response, i think gmail drafts are out to get me.
>
> -=Bill
>
> On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner <wfarner@apache.org> wrote:
>
>> I'm cool with #2, specifically if we do not attempt to parse the file and
>> use that to determine the auto-expire time.
>>
>>
>> -=Bill
>>
>> On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jcohen@twopensource.com>
>> wrote:
>>
>>> I'm in camp #2, I don't feel that it adds a significant amount of
>>> complexity to the health check logic, and it provides a substantial
>>> safeguard against users accidentally shooting themselves in the foot by
>>> accidentally leaving a health check snoozed.
>>>
>>> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <maxim@apache.org>
>>> wrote:
>>>
>>> > +1 to the #1. Disabling health checks is like signing a waiver where
>>> > all health check guarantees are off.
>>> >
>>> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <david.pan2@gmail.com>
>>> wrote:
>>> > > Hi Aurora,
>>> > >
>>> > > I am currently working on a feature that allows for health checks to
>>> be
>>> > > disabled temporarily for a running instance of a job.  The code
>>> review
>>> > can
>>> > > be found at https://reviews.apache.org/r/26383/.  The idea is that
>>> the
>>> > > presence of a special "snooze file" in the task's sandbox will
>>> trigger
>>> > the
>>> > > disabling of the health checks.
>>> > >
>>> > > Currently, the code reviewers have split off into two camps:
>>> > > 1. One set of reviewers believe that simplicity is key.  Disable the
>>> > health
>>> > > checks if the snooze file is present, enable it otherwise.
>>> > >
>>> > > 2. The other set of reviewers believe that there should be a snooze
>>> > > duration.  The timer starts when the snooze file is touched.  After
>>> the
>>> > > snooze duration is exhausted, the snooze file should be deleted by
>>> the
>>> > > health checker, and health checks resume.  This is useful if the
>>> process
>>> > > that initially disabled the health checks dies unexpectedly, and is
>>> no
>>> > > longer there to re-enable the health checks.
>>> > >
>>> > > I would like to invite anyone interested to voice your opinions and
>>> chime
>>> > > in.
>>> > >
>>> > > Thanks,
>>> > >
>>> > > David Pan
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message