zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Austin" <Austin.Mil...@morganstanley.com>
Subject Zk OOM in Critical Thread
Date Tue, 19 May 2015 15:10:31 GMT
Hi all,

We had an event in our prod cluster where an OOM caused a leader node to effectively become
corrupted while the rest of the ensemble thought it was healthy, thus permanently degrading
the ensemble to provide read only service on existing sessions until a human intervented.

Exceptions in Critical Threads

As a tactical step, we've added an OOMHandler to bounce the node.  However, we're cognizant
of the fact that other exceptions in this space can cause this issue again.  There is also
an interesting interaction with J8 which I will get to shortly.

In this link: http://arstechnica.com/information-technology/2015/05/the-discovery-of-apache-zookeepers-poison-packet/
 (specifically bug #1) seems to apply to this issue.  I haven't extensively gone through the
server code in some time, but will again shortly.  I'm wondering if this is seen as an issue
by the zookeeper dev community and if there are plans to respond.

OS: linux 64 bit
Zk: 3.4.6
jre: 1.8.31

2015-05-10 19:11:49,882 - ERROR [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2281:NIOServerCnxnFactory$1@44]
- Thread Thread[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2281,5,main] died

java.lang.OutOfMemoryError: Compressed class space
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:605)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:798)

Zookeeper and J8
So while this all was occurring, the CCS space in J8 filled up.  This space is, by default,
1G.  For it to fill up feels surprising.  Maybe it was somehow due to lots of connections
occurring.  This caused the OOM which caused the error in the leader thread.  I can't imagine
what ZK server is doing to legitimately fill this space without instrumentation being involved
somehow.  Or maybe J8 has a bug.  Any ideas on this would be appreciated.


NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained
herein are not intended to be, and do not constitute, advice within the meaning of Section
975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received
this communication in error, please destroy all electronic and paper copies; do not disclose,
use or act upon the information; and notify the sender immediately. Mistransmission is not
intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the
extent permitted under applicable law, to monitor electronic communications. This message
is subject to terms available at the following link: http://www.morganstanley.com/disclaimers
If you cannot access these links, please notify us by reply message and we will send the contents
to you. By messaging with Morgan Stanley you consent to the foregoing.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message