Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 62F77200C65 for ; Sat, 15 Apr 2017 00:24:28 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 61932160BA3; Fri, 14 Apr 2017 22:24:28 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8221D160B8C for ; Sat, 15 Apr 2017 00:24:27 +0200 (CEST) Received: (qmail 109 invoked by uid 500); 14 Apr 2017 22:24:26 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 99997 invoked by uid 99); 14 Apr 2017 22:24:26 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Apr 2017 22:24:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0BD80183A68; Fri, 14 Apr 2017 22:24:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.201 X-Spam-Level: **** X-Spam-Status: No, score=4.201 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, NML_ADSP_CUSTOM_MED=1.2, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 8G1ONkRv7wMp; Fri, 14 Apr 2017 22:24:23 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 1613B5FB49; Fri, 14 Apr 2017 22:24:23 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 70E34E0026; Fri, 14 Apr 2017 22:24:22 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 6292FC4040D; Fri, 14 Apr 2017 22:24:22 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============5982540891159012489==" MIME-Version: 1.0 Subject: Re: Review Request 58462: Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present From: Santhosh Kumar Shanmugham To: Joshua Cohen , Zameer Manji Cc: Santhosh Kumar Shanmugham , Aurora , Vladimir Khalatyan , Aurora ReviewBot Date: Fri, 14 Apr 2017 22:24:22 -0000 Message-ID: <20170414222422.20123.19167@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Santhosh Kumar Shanmugham X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/58462/ X-Sender: Santhosh Kumar Shanmugham References: <20170414212634.20123.59752@reviews-vm2.apache.org> In-Reply-To: <20170414212634.20123.59752@reviews-vm2.apache.org> Reply-To: Santhosh Kumar Shanmugham X-ReviewRequest-Repository: aurora archived-at: Fri, 14 Apr 2017 22:24:28 -0000 --===============5982540891159012489== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit > On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote: > > src/main/python/apache/aurora/executor/common/health_checker.py > > Lines 163-166 (patched) > > > > > > This will cause a task to get stuck in `STARTING` since `self.running` will never be set to `True`. > > > > Can you explain the particular usecase here? Also add a test case to exercise this branch. > > Vladimir Khalatyan wrote: > The idea is to make HealthCheck process to start after some of the setup processes are finished. With the current approach it's possible to addjust the "starting" point of the HealthCheck process by changing initial_interval_secs. But it means that we rely on the timing which doesn't guarantee anything. > The idea of HealthCheck "snoozing" is ignore any status of the healthcheck unless some process tells HealthCheck to start checking the health of the service. > > Example (simplified one): > Let's assume we start two processes on the machine: the LB registration and the UWSGI process. Let's say the uwsgi process requires some time to warm up. The LB registration depends on the load on LB, how soon uwsgi warms up, etc. So the actual moment when the application becomes available can vary from couple of seconds to minutes and we can not rely on initial_interval_secs. So we create a .healthchecksnooze file and ignore all results of the healthcheck unless this file is there. In a meanwhile the LB registration process will try to register service some number of times ( < max_failures) and delete the .healthchecksnooze after it succeeds. Since this particular moment the healthcheck will start incrementing the concecutive successes or failures and we can determine whether the deployment is successfull or not. > So with this approach we can specify the "starting" point of health checking more accurately and dependent on other processes. > > Here by "starting" point of the health check I mean the checking of the application health and changing the consecutive successes or failures, not the actual system process. > > Santhosh Kumar Shanmugham wrote: > > "So the actual moment when the application becomes available can vary from couple of seconds to minutes and we can not rely on initial_interval_secs." > > The current implementation addresses this problem of `initial_interval_secs` not responding faster with varying startup times. It achieves this by performing `health checks` during the startup time (`initial_interval_secs`) but ignores all failures during this period, however successful health checks now count towards transitioning the task to a healthy (RUNNING) state. Thereby it can accomodate both slow startup as well as fast startup without making the faster startup instances from waiting until the entire `initial_interval_secs` has expired. > > However for your change in particular, you might also need to account for `_should_enforce_deadline` - which will treat a task as unhealthy if it runs out of attempts. I was just looking at the docs and your usecase of health-check snoozing is vastly different from the usecase the documentation - > You can pause health checking by touching a file inside of your sandbox, named .healthchecksnooze. As long as that file is present, health checks will be disabled, enabling users to gather core dumps or other performance measurements without worrying about Aurora’s health check killing their process. Using health check snoozing to achieve synchronization of health checks, is indicative of how inflexible the health-checking mechanism is in reality. :( - Santhosh Kumar ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58462/#review172032 ----------------------------------------------------------- On April 14, 2017, 1:35 p.m., Vladimir Khalatyan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58462/ > ----------------------------------------------------------- > > (Updated April 14, 2017, 1:35 p.m.) > > > Review request for Aurora, Joshua Cohen and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py e9e4129af2db5202a82e9f6d54109a00bbae97ce > > > Diff: https://reviews.apache.org/r/58462/diff/1/ > > > Testing > ------- > > The Health Check is succeeding when the .healthchecksnooze is present. But it should just snooze which means there shouldn't be any increase in consecutive successes or consecutive failures. > > > Thanks, > > Vladimir Khalatyan > > --===============5982540891159012489==--