Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A97301016E for ; Tue, 19 May 2015 16:39:39 +0000 (UTC) Received: (qmail 45185 invoked by uid 500); 19 May 2015 16:39:38 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 45141 invoked by uid 500); 19 May 2015 16:39:38 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 45123 invoked by uid 99); 19 May 2015 16:39:38 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 May 2015 16:39:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0EDDD1A2FED for ; Tue, 19 May 2015 16:39:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.001 X-Spam-Level: X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31 tests=[URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id i0k2HJIK3duq for ; Tue, 19 May 2015 16:39:27 +0000 (UTC) Received: from relayvx11a.securemail.intermedia.net (relayvx11a.securemail.intermedia.net [64.78.56.46]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 201BA22F8E for ; Tue, 19 May 2015 16:39:27 +0000 (UTC) Received: from securemail.intermedia.net (localhost [127.0.0.1]) by emg-ca-1-1.localdomain (Postfix) with ESMTP id DB4D953E6B; Tue, 19 May 2015 09:38:43 -0700 (PDT) Subject: Re: Zk OOM in Critical Thread MIME-Version: 1.0 x-echoworx-emg-received: Tue, 19 May 2015 09:38:43.884 -0700 x-echoworx-msg-id: da135701-a99e-41e1-97d4-4e65ba425530 x-echoworx-action: delivered Received: from emg-ca-1-1.securemail.intermedia.net ([10.254.155.11]) by emg-ca-1-1 (JAMES SMTP Server 2.3.2) with SMTP ID 965; Tue, 19 May 2015 09:38:43 -0700 (PDT) Received: from MBX080-W4-CO-1.exch080.serverpod.net (unknown [10.224.117.101]) by emg-ca-1-1.localdomain (Postfix) with ESMTP id AAD9153E6B; Tue, 19 May 2015 09:38:43 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) by MBX080-W4-CO-1.exch080.serverpod.net (10.224.117.101) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Tue, 19 May 2015 09:38:42 -0700 Received: from MBX080-W4-CO-2.exch080.serverpod.net ([10.224.117.102]) by mbx080-w4-co-2.exch080.serverpod.net ([10.224.117.102]) with mapi id 15.00.1044.021; Tue, 19 May 2015 09:38:42 -0700 From: Chris Nauroth To: "user@zookeeper.apache.org" CC: "Gupta, Abhishek" , "Hejj, Botond" Thread-Topic: Zk OOM in Critical Thread Thread-Index: AdCSRfDLfVVs8fJRRRCpbUcVPZXd3QADFIIA Date: Tue, 19 May 2015 16:38:42 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [50.248.208.113] x-source-routing-agent: Processed Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable Hello Austin, Yes, the ZooKeeper dev community has reacted to this by porting an existing fix from trunk to the 3.4 code line. The fix has been targeted to the upcoming 3.4.7 release. For more details, see ZOOKEEPER-602. https://issues.apache.org/jira/browse/ZOOKEEPER-602 Additionally, we're going to recommend a best practice of killing the server on OutOfMemoryError. It sounds like you've already done this. ZOOKEEPER-2185 tracks the necessary updates to documentation and scripts. https://issues.apache.org/jira/browse/ZOOKEEPER-2185 The documentation already advised running the server under a process supervisor, so terminating on OutOfMemoryError will cause the process supervisor to initiate a restart. Leadership will transition to another node in the cluster. I haven't personally seen an instance of ZooKeeper throwing OutOfMemoryError for compressed class space, so I don't have any specific advice on that. Maybe others could respond if they've seen that. --Chris Nauroth On 5/19/15, 8:10 AM, "Miller, Austin" wrote: >Hi all, > >We had an event in our prod cluster where an OOM caused a leader node to >effectively become corrupted while the rest of the ensemble thought it >was healthy, thus permanently degrading the ensemble to provide read only >service on existing sessions until a human intervented. > >Exceptions in Critical Threads >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >As a tactical step, we've added an OOMHandler to bounce the node. >However, we're cognizant of the fact that other exceptions in this space >can cause this issue again. There is also an interesting interaction >with J8 which I will get to shortly. > >In this link:=20 >http://arstechnica.com/information-technology/2015/05/the-discovery-of-apa >che-zookeepers-poison-packet/ (specifically bug #1) seems to apply to >this issue. I haven't extensively gone through the server code in some >time, but will again shortly. I'm wondering if this is seen as an issue >by the zookeeper dev community and if there are plans to respond. > >OS: linux 64 bit >Zk: 3.4.6 >jre: 1.8.31 > >2015-05-10 19:11:49,882 - ERROR >[QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2281:NIOServerCnxnFactory$1@44] - >Thread Thread[QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2281,5,main] died > >java.lang.OutOfMemoryError: Compressed class space > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:760) > at=20 >java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:455) > at java.net.URLClassLoader.access$100(URLClassLoader.java:73) > at java.net.URLClassLoader$1.run(URLClassLoader.java:367) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:360) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at=20 >org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:6 >05) > at=20 >org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:798) > >Zookeeper and J8 >So while this all was occurring, the CCS space in J8 filled up. This >space is, by default, 1G. For it to fill up feels surprising. Maybe it >was somehow due to lots of connections occurring. This caused the OOM >which caused the error in the leader thread. I can't imagine what ZK >server is doing to legitimately fill this space without instrumentation >being involved somehow. Or maybe J8 has a bug. Any ideas on this would >be appreciated. >Austin > > >________________________________ > >NOTICE: Morgan Stanley is not acting as a municipal advisor and the >opinions or views contained herein are not intended to be, and do not >constitute, advice within the meaning of Section 975 of the Dodd-Frank >Wall Street Reform and Consumer Protection Act. If you have received this >communication in error, please destroy all electronic and paper copies; >do not disclose, use or act upon the information; and notify the sender >immediately. Mistransmission is not intended to waive confidentiality or >privilege. Morgan Stanley reserves the right, to the extent permitted >under applicable law, to monitor electronic communications. This message >is subject to terms available at the following link: >http://www.morganstanley.com/disclaimers If you cannot access these >links, please notify us by reply message and we will send the contents to >you. By messaging with Morgan Stanley you consent to the foregoing.