Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1642E4754 for ; Sun, 15 May 2011 02:47:03 +0000 (UTC) Received: (qmail 70757 invoked by uid 500); 15 May 2011 02:47:00 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 70288 invoked by uid 500); 15 May 2011 02:47:00 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 70280 invoked by uid 99); 15 May 2011 02:47:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 May 2011 02:47:00 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hannes@eyealike.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 May 2011 02:46:53 +0000 Received: by gyf3 with SMTP id 3so1569266gyf.31 for ; Sat, 14 May 2011 19:46:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.105.163 with SMTP id k23mr3019267yhg.152.1305427592326; Sat, 14 May 2011 19:46:32 -0700 (PDT) Received: by 10.236.103.132 with HTTP; Sat, 14 May 2011 19:46:32 -0700 (PDT) In-Reply-To: References: Date: Sat, 14 May 2011 19:46:32 -0700 Message-ID: Subject: Re: Native heap leaks? From: Hannes Schmidt To: user@cassandra.apache.org, Chris Burroughs , Jonathan Ellis Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Ok, so I think I found one major cause contributing to the increasing resident size of the Cassandra process. Looking at the OpenJDK sources was of great help in understanding the problem but my findings also apply to the Sun/Oracle JDK because the affected code is shared by both. Each IncomingTcpConnection (ITC) thread handles a socket to another node. That socket is a server socket returned from ServerSocket.accept() and as such it is implemented on top of an NIO socket channel (sun.nio.ch.SocketAdaptor) which in turn makes use of direct byte buffers. It obtains these buffers from sun.nio.ch.Util which caches the 3 most recently used buffers per thread. If a cached buffer isn't large enough for a message, a new one that is will replace it. The size of the buffer is determined by the amount of data that the application requests to be read. ITC uses the readFully() method of DataInputStream (DIS) to read data into a byte array allocated to hold the entire message: int size =3D socketStream.readInt(); byte[] buffer =3D new byte[size]; socketStream.readFully(buffer); Whatever the value of 'size' will end up being the size of the direct buffer allocated by the socket channel code. Our application uses range queries whose result sets are around 40 megabytes in size. If a range isn't hosted on the node the application client is connected to, the range result set will be fetched from another node. When that other node has prepared the result it will send it back (asynchonously, this took me a while to grasp) and it will end up in the direct byte buffer that is cached by sun.nio.ch.Util for the ITC thread on the original node. The thing is that the buffer holds the entire message, all 40 megs of it. ITC is rather long-lived and so the buffers will simply stick around. Our range queries cover the entire ring (we do a lot of "map reduce") and so each node ends up with as many 40M buffers as we have nodes in the ring, 10 in our case. That's 400M of native heap space wasted on each node. Each ITC thread holds onto the historically largest direct buffer, possibly for a long time. This could be alleviated by periodically closing the connection and thereby releasing a potentially large buffer and replacing it with a new thread that starts with a clean slate. If all queries have large result sets, this solution won't help. Another alternative is to read the message incrementally rather than buffering it in its entirety in a byte array as ITC currently does. A third and possibly the simplest solution would be to read the messages into the buffer in chunks of say 1M. DIS has offers readFully( data, offset, length ) for that. I have tried this solution and it fixes this problem for us. I'll open an issue and submit my patch. We have observed the issue with 0.6.12 but from looking at ITC in trunk it seems to be affected too. It gets worse though: even after the ITC thread dies, the cached buffers stick around as they are being held via SoftReferences. SR's are released only as a last resort to prevent an OutOfMemoryException. Using SR's for caching direct buffers is silly because direct buffers have negligible impact on the Java heap but may have dramatic impact on the native heap. I am not the only one who thinks so [1]. In other words, sun.nio.ch.Util's buffer caching is severely broken. I tried to find a way to explicitly release soft references but haven't found anything other than the allocation of an oversized array to force an OutOfMemoryException. The only thing we can do is to keep the buffer sizes small in order to reduce the impact of the leak. My patch takes care of that. I will post a link to the JIRA issue with the patch shortly. [1] http://bugs.sun.com/view_bug.do?bug_id=3D6210541 On Wed, May 4, 2011 at 11:50 AM, Hannes Schmidt wrote= : > Hi, > > We are using Cassandra 0.6.12 in a cluster of 9 nodes. Each node is > 64-bit, has 4 cores and 4G of RAM and runs on Ubuntu Lucid with the > stock 2.6.32-31-generic kernel. We use the Sun/Oracle JDK. > > Here's the problem: The Cassandra process starts up with 1.1G resident > memory (according to top) but slowly grows to 2.1G at a rate that > seems proportional to the write load. No writes, no growth. The node > is running other memory-sensitive applications (a second JVM for our > in-house webapp and a short-lived C++ program) so we need to ensure > that each process stays within certain bounds as far as memory > requirements go. The nodes OOM and crash when the Cassandra process is > at 2.1G so I can't say if the growth is bounded or not. > > Looking at the=A0/proc/$pid/smaps=A0for the Cassandra process it seems to > me that it is the native heap of the Cassandra JVM that is leaking. I > attached a readable version of the smaps file generated by [1]. > > Some more data: Cassandra runs with default command line arguments, > which means it gets 1G heap. The JNA jar is present and Cassandra logs > that the memory locking was successful. In storage-conf.xml, > DiskAccessMode is mmap_index_only. Other than that and some increased > timeouts we left the defaults. Swap is completely disabled. I don't > think this is related but I am mentioning it anyways: overcommit [2] > is always-on (vm.overcommit_memory=3D1). Without that we get OOMs when > our application JVM is fork()'ing and exec()'ing our C++program even > though there is enough free RAM to satisfy the demands of the C++ > program. We think this is caused by a flawed kernel heuristic that > assumes that the forked process (our C++ app) is as big as the forking > one (the 2nd JVM). Anyways, the Cassandra process leaks with both, > vm.overcommit_memory=3D0 (the default) and 1. > > Whether it is the native heap that leaks or something else, I think > that 1.1G of additional RAM for 1G of Java heap can't be normal. I'd > be grateful for any insights or pointers at what to try next. > > [1] http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html > [2] http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6 >