aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Cohen <jco...@twopensource.com>
Subject Re: Review Request 26383: Health Check Disabler
Date Thu, 09 Oct 2014 21:12:47 GMT
I feel like we should err on the side of correctness here, rather than
simplicity? The dangers of someone accidentally leaving health checks
disabled indefinitely (on a service that has opted in to health checks) are
not insignificant.

On Thu, Oct 9, 2014 at 1:34 PM, Kevin Sweeney <kevints@apache.org> wrote:

>
>
> > On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote:
> > > src/main/python/apache/aurora/executor/common/health_checker.py, line
> 66
> > > <
> https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66>
> > >
> > >     FWIW i actually meant to suggest that the snooze has no concept of
> time at all.  If the file is there, don't perform health checks.  When you
> want to re-enable health checks, delete the file.  Happy to hear what
> others think about that.
> >
> > Zameer Manji wrote:
> >     Doesn't this mean user error can disable health checks forever? I
> think we should treat disabling health checking as an exceptional state
> (since the proceses has opted in to health checking) and therefore require
> user action (increasing mtime) to continue to stay in this state.
> >
> > Kevin Sweeney wrote:
> >     Presumably we can trust the user here - health checking is after all
> opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch
> greatly simplifies the code on our end.
> >
> > Zameer Manji wrote:
> >     I'm very worried about introducing a feature which can allow
> unhealthy instances live forever. Furthermore this information isn't
> exposed on the Aurora UI or Observer UI so there isn't an easy way to check
> if an instance has health checking disabled or not. A user can be
> completely oblivious that health checking of their instance has been
> disabled.
> >
> > David Pan wrote:
> >     The motivation behind the health check disabler is for another
> project I am working on, which is the JVM Heap Dumper.  Long story short,
> the JVM Heap Dumper needs to disable health checks for the task that is
> being heap dumped.  Under normal circumstances, the heap dumper can
> re-enable the health checks.  However, if the heap dumper dies
> unexpectedly, the health checks will remain disabled forever if we don't
> implement time control.  We don't want the disabling and enabling of health
> checks to be a manual process.
>
> I still agree with Bill here - the simpler implementation (skip health
> check if the file exists) is easier to reason about and a cleaner API.
>
> A process running in the sandbox could implement some form of timeout
> logic itself (for example, the application could expose an endpoint that
> forks the equivalent of "touch .healthchecksnooze; sleep 60; rm
> .healthchecksnooze") but I like the idea of keeping the executor API here
> simple to explain and reason about.
>
>
> - Kevin
>
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26383/#review55985
> -----------------------------------------------------------
>
>
> On Oct. 8, 2014, 6:56 p.m., David Pan wrote:
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/26383/
> > -----------------------------------------------------------
> >
> > (Updated Oct. 8, 2014, 6:56 p.m.)
> >
> >
> > Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
> >
> >
> > Repository: aurora
> >
> >
> > Description
> > -------
> >
> > The health check disabler allows health checks for a job to be snoozed
> temporarily by touching a snooze file in the job's sandbox.  The
> appropriate unit tests were modified/added.
> >
> > The corresponding JIRA ticket is the following:
> > https://issues.apache.org/jira/browse/AURORA-795
> >
> >
> > Diffs
> > -----
> >
> >   src/main/python/apache/aurora/executor/common/health_checker.py
> 4980411c847d12655cbb363404707ebd9f0bd163
> >   src/test/python/apache/aurora/executor/common/BUILD
> c7f7a003c865d479ba6e3cd7b5349322f884f653
> >   src/test/python/apache/aurora/executor/common/test_health_checker.py
> aa36415fa891fc523a3a376ffeca5d3cd5ceabec
> >
> > Diff: https://reviews.apache.org/r/26383/diff/
> >
> >
> > Testing
> > -------
> >
> > On vagrant in ~/aurora, I ran
> > ./pants src/test/python/apache/aurora/executor::
> >
> >
> > Thanks,
> >
> > David Pan
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message