Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 40527200D2E for ; Tue, 31 Oct 2017 17:36:25 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 3EEF71609EF; Tue, 31 Oct 2017 16:36:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5D6551609E6 for ; Tue, 31 Oct 2017 17:36:24 +0100 (CET) Received: (qmail 26485 invoked by uid 500); 31 Oct 2017 16:36:23 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 26474 invoked by uid 99); 31 Oct 2017 16:36:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Oct 2017 16:36:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 771B9D25DB for ; Tue, 31 Oct 2017 16:36:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.25 X-Spam-Level: **** X-Spam-Status: No, score=4.25 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, MANY_SPAN_IN_TEXT=1, NORMAL_HTTP_TO_IP=0.001, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 939Sy_s5iwiw for ; Tue, 31 Oct 2017 16:36:20 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 316F15F3E0 for ; Tue, 31 Oct 2017 16:36:20 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id BFD96E0594; Tue, 31 Oct 2017 16:36:19 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id A7B7EC400D7; Tue, 31 Oct 2017 16:36:19 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============2399329708750423779==" MIME-Version: 1.0 Subject: Re: Review Request 63443: Terminate the executor on unhandled errors From: Aurora ReviewBot To: Bill Farner , Zameer Manji Cc: Aurora , Stephan Erb , Aurora ReviewBot Date: Tue, 31 Oct 2017 16:36:19 -0000 Message-ID: <20171031163619.27698.40213@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Aurora ReviewBot X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/63443/ X-Sender: Aurora ReviewBot References: <20171031161725.27698.82693@reviews-vm2.apache.org> In-Reply-To: <20171031161725.27698.82693@reviews-vm2.apache.org> X-ReviewBoard-Diff-For: src/main/python/apache/thermos/common/excepthook.py Reply-To: Aurora ReviewBot X-ReviewRequest-Repository: aurora archived-at: Tue, 31 Oct 2017 16:36:25 -0000 --===============2399329708750423779== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63443/#review189731 ----------------------------------------------------------- Master (9e646ae) is green with this patch. ./build-support/jenkins/build.sh However, it appears that it might lack test coverage. I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On Oct. 31, 2017, 4:17 p.m., Stephan Erb wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63443/ > ----------------------------------------------------------- > > (Updated Oct. 31, 2017, 4:17 p.m.) > > > Review request for Aurora, Bill Farner and Zameer Manji. > > > Bugs: AURORA-1955 > https://issues.apache.org/jira/browse/AURORA-1955 > > > Repository: aurora > > > Description > ------- > > This commit consits of two independent parts: > > a) ensure we interrupt the main thread when there are unhandled exceptions > b) ensure the main thread of the executor can be interrupted > > > Diffs > ----- > > src/main/python/apache/aurora/executor/bin/thermos_executor_main.py a191cf9eec844035c0f6aa5aed3731a06024c0df > src/main/python/apache/aurora/tools/thermos.py de20c06cea5bbb45c7a6f5acfeee69289f8e6ad8 > src/main/python/apache/aurora/tools/thermos_observer.py 0318f990ac003c0b8925b7eb7359431cdee34f05 > src/main/python/apache/thermos/common/excepthook.py PRE-CREATION > src/main/python/apache/thermos/runner/thermos_runner.py 847f51ed2c0e003f1325aa903bd0f0b760acb365 > > > Diff: https://reviews.apache.org/r/63443/diff/1/ > > > Testing > ------- > > This bug is pretty hard to reproduce and test. I therefore opted for a manual > verification and injected an exception throw shortly before the last statement > of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in > hanging executors on the host. With this patch everything is terminated as > expected. > > For details of the suffessful run, please see the executor logs below. Please > note that the `apport.fileutils` is due to Ubuntu messing with its Python > installation. This is not critical. > > ``` > twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.) > I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0 > I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0 > Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d > > ERROR] Unhandled error in . Interrupting main thread. > Traceback (most recent call last): > File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run > self.__real_run(*args, **kw) > File "apache/aurora/executor/status_manager.py", line 62, in run > File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown > RuntimeError: Woops! > Exception in thread Thread-7 [TID=25450]: > Traceback (most recent call last): > File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner > self.run() > File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified > return instancemethod(self, *args, **kwargs) > File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run > sys.excepthook(*sys.exc_info()) > File "apache/thermos/common/excepthook.py", line 41, in teardown_handler > self._former_hook()(exc_type, value, trace) > File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook > from apport.fileutils import likely_packaged, get_recent_crashes > ImportError: No module named apport.fileutils > > twitter.common.app debug: main exited with ^C > twitter.common.app debug: Shutting application down. > twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.) > twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.) > twitter.common.app debug: Finishing up module teardown. > twitter.common.app debug: Active thread: <_MainThread(MainThread, started 139968622749504)> > twitter.common.app debug: Active thread (daemon): > twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)> > twitter.common.app debug: Active thread (daemon): > twitter.common.app debug: Active thread (daemon): > twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)> > twitter.common.app debug: Active thread (daemon): > twitter.common.app debug: Exiting cleanly. > ``` > > Corresponding agent logs, indicating that Mesos knows about the crash on teardown: > ``` > I1031 15:59:54.692739 1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130 > I1031 15:59:54.692834 1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931 > I1031 15:59:54.692996 1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 > ``` > > > Thanks, > > Stephan Erb > > --===============2399329708750423779==--