Return-Path: X-Original-To: apmail-aurora-reviews-archive@minotaur.apache.org Delivered-To: apmail-aurora-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2950E1796C for ; Fri, 10 Oct 2014 17:36:04 +0000 (UTC) Received: (qmail 89982 invoked by uid 500); 10 Oct 2014 17:36:04 -0000 Delivered-To: apmail-aurora-reviews-archive@aurora.apache.org Received: (qmail 89937 invoked by uid 500); 10 Oct 2014 17:36:03 -0000 Mailing-List: contact reviews-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.incubator.apache.org Delivered-To: mailing list reviews@aurora.incubator.apache.org Received: (qmail 89926 invoked by uid 99); 10 Oct 2014 17:36:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Oct 2014 17:36:03 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 10 Oct 2014 17:35:39 +0000 Received: (qmail 87954 invoked by uid 99); 10 Oct 2014 17:35:36 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Oct 2014 17:35:36 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 01EEE1DDD94; Fri, 10 Oct 2014 17:35:31 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============6026503573635860651==" MIME-Version: 1.0 Subject: Re: Review Request 26383: Health Check Disabler From: "David Pan" To: "Joe Smith" , "Brian Wickman" , "Zameer Manji" Cc: "Joshua Cohen" , "Bill Farner" , "Aurora" , "Kevin Sweeney" , "David Pan" Date: Fri, 10 Oct 2014 17:35:31 -0000 Message-ID: <20141010173531.24818.16897@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "David Pan" X-ReviewGroup: Aurora X-ReviewRequest-URL: https://reviews.apache.org/r/26383/ X-Sender: "David Pan" References: <20141009145313.24817.15764@reviews.apache.org> In-Reply-To: <20141009145313.24817.15764@reviews.apache.org> Reply-To: "David Pan" X-ReviewRequest-Repository: aurora X-Virus-Checked: Checked by ClamAV on apache.org --===============6026503573635860651== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote: > > src/main/python/apache/aurora/executor/common/health_checker.py, line 66 > > > > > > FWIW i actually meant to suggest that the snooze has no concept of time at all. If the file is there, don't perform health checks. When you want to re-enable health checks, delete the file. Happy to hear what others think about that. > > Zameer Manji wrote: > Doesn't this mean user error can disable health checks forever? I think we should treat disabling health checking as an exceptional state (since the proceses has opted in to health checking) and therefore require user action (increasing mtime) to continue to stay in this state. > > Kevin Sweeney wrote: > Presumably we can trust the user here - health checking is after all opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch greatly simplifies the code on our end. > > Zameer Manji wrote: > I'm very worried about introducing a feature which can allow unhealthy instances live forever. Furthermore this information isn't exposed on the Aurora UI or Observer UI so there isn't an easy way to check if an instance has health checking disabled or not. A user can be completely oblivious that health checking of their instance has been disabled. > > David Pan wrote: > The motivation behind the health check disabler is for another project I am working on, which is the JVM Heap Dumper. Long story short, the JVM Heap Dumper needs to disable health checks for the task that is being heap dumped. Under normal circumstances, the heap dumper can re-enable the health checks. However, if the heap dumper dies unexpectedly, the health checks will remain disabled forever if we don't implement time control. We don't want the disabling and enabling of health checks to be a manual process. > > Kevin Sweeney wrote: > I still agree with Bill here - the simpler implementation (skip health check if the file exists) is easier to reason about and a cleaner API. > > A process running in the sandbox could implement some form of timeout logic itself (for example, the application could expose an endpoint that forks the equivalent of "touch .healthchecksnooze; sleep 60; rm .healthchecksnooze") but I like the idea of keeping the executor API here simple to explain and reason about. > > David Pan wrote: > Overall, wouldn't it be simpler for the executor API to handle the time control, rather than requiring an identical process in every task's sandbox to handle the time control? > > Joshua Cohen wrote: > I feel like we should err on the side of correctness here, rather than simplicity? The dangers of someone accidentally leaving health checks disabled indefinitely (on a service that has opted in to health checks) are not insignificant. I feel like this discussion warrants a short meeting. I'll set something up on the calendar. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26383/#review55985 ----------------------------------------------------------- On Oct. 9, 2014, 1:56 a.m., David Pan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/26383/ > ----------------------------------------------------------- > > (Updated Oct. 9, 2014, 1:56 a.m.) > > > Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > The health check disabler allows health checks for a job to be snoozed temporarily by touching a snooze file in the job's sandbox. The appropriate unit tests were modified/added. > > The corresponding JIRA ticket is the following: > https://issues.apache.org/jira/browse/AURORA-795 > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py 4980411c847d12655cbb363404707ebd9f0bd163 > src/test/python/apache/aurora/executor/common/BUILD c7f7a003c865d479ba6e3cd7b5349322f884f653 > src/test/python/apache/aurora/executor/common/test_health_checker.py aa36415fa891fc523a3a376ffeca5d3cd5ceabec > > Diff: https://reviews.apache.org/r/26383/diff/ > > > Testing > ------- > > On vagrant in ~/aurora, I ran > ./pants src/test/python/apache/aurora/executor:: > > > Thanks, > > David Pan > > --===============6026503573635860651==--