Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C341A200D30 for ; Mon, 30 Oct 2017 23:30:11 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C1CC3160BE4; Mon, 30 Oct 2017 22:30:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 117401609D5 for ; Mon, 30 Oct 2017 23:30:10 +0100 (CET) Received: (qmail 21415 invoked by uid 500); 30 Oct 2017 22:30:05 -0000 Mailing-List: contact issues-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list issues@aurora.apache.org Received: (qmail 21399 invoked by uid 99); 30 Oct 2017 22:30:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Oct 2017 22:30:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6DA5F180157 for ; Mon, 30 Oct 2017 22:30:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.752 X-Spam-Level: X-Spam-Status: No, score=-99.752 tagged_above=-999 required=6.31 tests=[KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id p6yjt8g5op7v for ; Mon, 30 Oct 2017 22:30:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9509D60F18 for ; Mon, 30 Oct 2017 22:30:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C6EBEE0555 for ; Mon, 30 Oct 2017 22:30:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7CEE9212F7 for ; Mon, 30 Oct 2017 22:30:01 +0000 (UTC) Date: Mon, 30 Oct 2017 22:30:01 +0000 (UTC) From: "Mohit Jaggi (JIRA)" To: issues@aurora.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (AURORA-1955) thermos should exit on irrecoverable errors to avoid zombies MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 30 Oct 2017 22:30:12 -0000 Mohit Jaggi created AURORA-1955: ----------------------------------- Summary: thermos should exit on irrecoverable errors to avoid zombies Key: AURORA-1955 URL: https://issues.apache.org/jira/browse/AURORA-1955 Project: Aurora Issue Type: Bug Components: Thermos Reporter: Mohit Jaggi Assignee: Stephan Erb Fix For: 0.18.1 We found several zombie executors on a cluster. Thermos logs indicate reaching system limits while trying to shutdown(?). Mesos agent is unable to get status of this container from docker daemon (docker inspect fails). Shouldn't thermos exit in such a case? 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap. 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.) 24 Writing log files to disk in /mnt/mesos/sandbox 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 27 Writing log files to disk in /mnt/mesos/sandbox 28 Traceback (most recent call last): 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run 30 self.__real_run(*args, **kw) 31 File "apache/thermos/monitoring/resource.py", line 243, in run 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait 33 thread.start() 34 File "/usr/lib/python2.7/threading.py", line 745, in start 35 _start_new_thread(self.__bootstrap, ()) 36 thread.error: can't start new thread 37 ERROR] Failed to stop health checkers: 38 ERROR] Traceback (most recent call last): 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT) 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline 42 return deadline(*args, daemon=True, propagate=True, **kw) 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline 44 AnonymousThread().start() 45 File "/usr/lib/python2.7/threading.py", line 745, in start 46 _start_new_thread(self.__bootstrap, ()) 47 error: can't start new thread 48 49 ERROR] Failed to stop runner: 50 ERROR] Traceback (most recent call last): 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline 54 return deadline(*args, daemon=True, propagate=True, **kw) 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline 56 AnonymousThread().start() 57 File "/usr/lib/python2.7/threading.py", line 745, in start 58 _start_new_thread(self.__bootstrap, ()) 59 error: can't start new thread 60 61 Traceback (most recent call last): 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run 63 self.__real_run(*args, **kw) 64 File "apache/aurora/executor/status_manager.py", line 62, in run 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer 67 deferred.start() 68 File "/usr/lib/python2.7/threading.py", line 745, in start 69 _start_new_thread(self.__bootstrap, ()) 70 thread.error: can't start new thread -- This message was sent by Atlassian JIRA (v6.4.14#64029)