Return-Path: X-Original-To: apmail-aurora-reviews-archive@minotaur.apache.org Delivered-To: apmail-aurora-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 671F917F56 for ; Thu, 9 Oct 2014 21:13:15 +0000 (UTC) Received: (qmail 62325 invoked by uid 500); 9 Oct 2014 21:13:15 -0000 Delivered-To: apmail-aurora-reviews-archive@aurora.apache.org Received: (qmail 62280 invoked by uid 500); 9 Oct 2014 21:13:15 -0000 Mailing-List: contact reviews-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.incubator.apache.org Delivered-To: mailing list reviews@aurora.incubator.apache.org Received: (qmail 62248 invoked by uid 99); 9 Oct 2014 21:13:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Oct 2014 21:13:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jcohen@twopensource.com designates 209.85.192.53 as permitted sender) Received: from [209.85.192.53] (HELO mail-qg0-f53.google.com) (209.85.192.53) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Oct 2014 21:12:48 +0000 Received: by mail-qg0-f53.google.com with SMTP id a108so2966396qge.26 for ; Thu, 09 Oct 2014 14:12:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=4T6QFI+y+tuX2Bo5870eGqbPdSaA2tpyoPwmVIbBbxI=; b=k/e0SY2boik97Ymm38e6EC/TxM0gNztoSKaxgBc5vRXKz38upj+7egIxzPQX1f4uLL LqAwHQ/NRk9St7a/JaFooG77WmXaNSBDXomvHPZafl1gvn4hdQ9URTvLRHVJb/pZpP+v EQUip5Lc/nUvIONlx6LhIDBJxog2ifRIXrzgfNyjdbu8cHb7kqpwX/xoxRHQJ3HYs0dp j2gsIE143wU3XJ+MDbyaS3uTvccZx5n1AOXuJS5uSiCFQUd5C5+IGz1PJkW9eS7fhqO2 ueMB6D6Lm2Lmeyw9gD0zk/SR8MZKupxkBqdS+XJKS7rQp9xfyIdcJspzwsFcWxS9gMWm eMzg== X-Gm-Message-State: ALoCoQl0MT9IAVhPl0pDSsY2gXP0BaMRlcNOm+oRCTTfEM4bqTBTqhK02y2JontMs5XCVBJYPhV6 MIME-Version: 1.0 X-Received: by 10.224.120.193 with SMTP id e1mr1344440qar.80.1412889167584; Thu, 09 Oct 2014 14:12:47 -0700 (PDT) Received: by 10.140.96.183 with HTTP; Thu, 9 Oct 2014 14:12:47 -0700 (PDT) In-Reply-To: <20141009203431.24818.10437@reviews.apache.org> References: <20141009145313.24817.15764@reviews.apache.org> <20141009203431.24818.10437@reviews.apache.org> Date: Thu, 9 Oct 2014 14:12:47 -0700 Message-ID: Subject: Re: Review Request 26383: Health Check Disabler From: Joshua Cohen To: Aurora , Kevin Sweeney Cc: Joe Smith , Brian Wickman , Zameer Manji , Bill Farner , David Pan Content-Type: multipart/alternative; boundary=001a11c1bf52f0fa32050503e401 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1bf52f0fa32050503e401 Content-Type: text/plain; charset=UTF-8 I feel like we should err on the side of correctness here, rather than simplicity? The dangers of someone accidentally leaving health checks disabled indefinitely (on a service that has opted in to health checks) are not insignificant. On Thu, Oct 9, 2014 at 1:34 PM, Kevin Sweeney wrote: > > > > On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote: > > > src/main/python/apache/aurora/executor/common/health_checker.py, line > 66 > > > < > https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66> > > > > > > FWIW i actually meant to suggest that the snooze has no concept of > time at all. If the file is there, don't perform health checks. When you > want to re-enable health checks, delete the file. Happy to hear what > others think about that. > > > > Zameer Manji wrote: > > Doesn't this mean user error can disable health checks forever? I > think we should treat disabling health checking as an exceptional state > (since the proceses has opted in to health checking) and therefore require > user action (increasing mtime) to continue to stay in this state. > > > > Kevin Sweeney wrote: > > Presumably we can trust the user here - health checking is after all > opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch > greatly simplifies the code on our end. > > > > Zameer Manji wrote: > > I'm very worried about introducing a feature which can allow > unhealthy instances live forever. Furthermore this information isn't > exposed on the Aurora UI or Observer UI so there isn't an easy way to check > if an instance has health checking disabled or not. A user can be > completely oblivious that health checking of their instance has been > disabled. > > > > David Pan wrote: > > The motivation behind the health check disabler is for another > project I am working on, which is the JVM Heap Dumper. Long story short, > the JVM Heap Dumper needs to disable health checks for the task that is > being heap dumped. Under normal circumstances, the heap dumper can > re-enable the health checks. However, if the heap dumper dies > unexpectedly, the health checks will remain disabled forever if we don't > implement time control. We don't want the disabling and enabling of health > checks to be a manual process. > > I still agree with Bill here - the simpler implementation (skip health > check if the file exists) is easier to reason about and a cleaner API. > > A process running in the sandbox could implement some form of timeout > logic itself (for example, the application could expose an endpoint that > forks the equivalent of "touch .healthchecksnooze; sleep 60; rm > .healthchecksnooze") but I like the idea of keeping the executor API here > simple to explain and reason about. > > > - Kevin > > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/26383/#review55985 > ----------------------------------------------------------- > > > On Oct. 8, 2014, 6:56 p.m., David Pan wrote: > > > > ----------------------------------------------------------- > > This is an automatically generated e-mail. To reply, visit: > > https://reviews.apache.org/r/26383/ > > ----------------------------------------------------------- > > > > (Updated Oct. 8, 2014, 6:56 p.m.) > > > > > > Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji. > > > > > > Repository: aurora > > > > > > Description > > ------- > > > > The health check disabler allows health checks for a job to be snoozed > temporarily by touching a snooze file in the job's sandbox. The > appropriate unit tests were modified/added. > > > > The corresponding JIRA ticket is the following: > > https://issues.apache.org/jira/browse/AURORA-795 > > > > > > Diffs > > ----- > > > > src/main/python/apache/aurora/executor/common/health_checker.py > 4980411c847d12655cbb363404707ebd9f0bd163 > > src/test/python/apache/aurora/executor/common/BUILD > c7f7a003c865d479ba6e3cd7b5349322f884f653 > > src/test/python/apache/aurora/executor/common/test_health_checker.py > aa36415fa891fc523a3a376ffeca5d3cd5ceabec > > > > Diff: https://reviews.apache.org/r/26383/diff/ > > > > > > Testing > > ------- > > > > On vagrant in ~/aurora, I ran > > ./pants src/test/python/apache/aurora/executor:: > > > > > > Thanks, > > > > David Pan > > > > > > --001a11c1bf52f0fa32050503e401--