Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 250DD200CBD for ; Thu, 22 Jun 2017 00:09:54 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 23FD6160BF0; Wed, 21 Jun 2017 22:09:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6581D160BD5 for ; Thu, 22 Jun 2017 00:09:53 +0200 (CEST) Received: (qmail 88775 invoked by uid 500); 21 Jun 2017 22:09:52 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 88764 invoked by uid 99); 21 Jun 2017 22:09:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jun 2017 22:09:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D5491192670 for ; Wed, 21 Jun 2017 22:09:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.249 X-Spam-Level: *** X-Spam-Status: No, score=3.249 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id aE2PKp8sSytJ for ; Wed, 21 Jun 2017 22:09:50 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B29A45F2A9 for ; Wed, 21 Jun 2017 22:09:49 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 12F79E0069; Wed, 21 Jun 2017 22:09:49 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 004E9C404B8; Wed, 21 Jun 2017 22:09:49 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============6547208101223363328==" MIME-Version: 1.0 Subject: Re: Review Request 60306: Ensure Thermos is not overloaded by an unlimited number of lost processes From: Zameer Manji To: Aurora , Stephan Erb , Zameer Manji Date: Wed, 21 Jun 2017 22:09:48 -0000 Message-ID: <20170621220948.12058.86957@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Zameer Manji X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/60306/ X-Sender: Zameer Manji X-ReviewBoard-ShipIt: 1 References: <20170621213717.39411.53371@reviews-vm2.apache.org> In-Reply-To: <20170621213717.39411.53371@reviews-vm2.apache.org> Reply-To: Zameer Manji X-ReviewRequest-Repository: aurora archived-at: Wed, 21 Jun 2017 22:09:54 -0000 --===============6547208101223363328== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60306/#review178571 ----------------------------------------------------------- Ship it! I once noticed this in production but I neglected to investigate further. Thanks for this fix. src/main/python/apache/thermos/common/planner.py Lines 307 (patched) There is duplication here with the failure case. It might be worthwhile to extract this logic out to a private method instead of duplicating it here. - Zameer Manji On June 21, 2017, 2:37 p.m., Stephan Erb wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60306/ > ----------------------------------------------------------- > > (Updated June 21, 2017, 2:37 p.m.) > > > Review request for Aurora. > > > Repository: aurora > > > Description > ------- > > Included changes: > > * Thermos may consider launched processes to be LOST. Instead of > restarting those immediately, the restarts will now be at least > `min_duration` seconds apart. Restarts will also be capped at the > TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the > observer will be overloaded by checkpoints. The handling of the LOST > state is now consistent with the handling of both FAILED and FINISHED. > * Mark the success_transition and failure_transition as private. They > are only used within `TaskPlanner` itself. > * Fix documented default of `min_duration` (i.e 5s rather than 15s). > > > Diffs > ----- > > docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894 > src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f > src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9 > > > Diff: https://reviews.apache.org/r/60306/diff/1/ > > > Testing > ------- > > ./pants test.pytest src/test/python:: > > > File Attachments > ---------------- > > massive_cpu_spike.png > https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png > massive_restart_count.png > https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png > > > Thanks, > > Stephan Erb > > --===============6547208101223363328==--