Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of hannes@eyealike.com designates
 209.85.160.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BANLkTi=7aJrsqdenjDZ-S04zcuHPp==GoQ@mail.gmail.com>
References: <BANLkTi=7aJrsqdenjDZ-S04zcuHPp==GoQ@mail.gmail.com>
Date: Sat, 14 May 2011 19:46:32 -0700
Message-ID: <BANLkTikNWj9zQkKJ6AURhPRQQcoLw8tSmQ@mail.gmail.com>
Subject: Re: Native heap leaks?
From: Hannes Schmidt <hannes@eyealike.com>
To: user@cassandra.apache.org, Chris Burroughs <chris.burroughs@gmail.com>,
	Jonathan Ellis <jbellis@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ok, so I think I found one major cause contributing to the increasing
resident size of the Cassandra process. Looking at the OpenJDK sources
was of great help in understanding the problem but my findings also
apply to the Sun/Oracle JDK because the affected code is shared by
both.

Each IncomingTcpConnection (ITC) thread handles a socket to another
node. That socket is a server socket returned from
ServerSocket.accept() and as such it is implemented on top of an NIO
socket channel (sun.nio.ch.SocketAdaptor) which in turn makes use of
direct byte buffers. It obtains these buffers from sun.nio.ch.Util
which caches the 3 most recently used buffers per thread. If a cached
buffer isn't large enough for a message, a new one that is will
replace it. The size of the buffer is determined by the amount of data
that the application requests to be read. ITC uses the readFully()
method of DataInputStream (DIS) to read data into a byte array
allocated to hold the entire message:

int size =3D socketStream.readInt();
byte[] buffer =3D new byte[size];
socketStream.readFully(buffer);

Whatever the value of 'size' will end up being the size of the direct
buffer allocated by the socket channel code.

Our application uses range queries whose result sets are around 40
megabytes in size. If a range isn't hosted on the node the application
client is connected to, the range result set will be fetched from
another node. When that other node has prepared the result it will
send it back (asynchonously, this took me a while to grasp) and it
will end up in the direct byte buffer that is cached by
sun.nio.ch.Util for the ITC thread on the original node.

The thing is that the buffer holds the entire message, all 40 megs of
it. ITC is rather long-lived and so the buffers will simply stick
around. Our range queries cover the entire ring (we do a lot of "map
reduce") and so each node ends up with as many 40M buffers as we have
nodes in the ring, 10 in our case. That's 400M of native heap space
wasted on each node.

Each ITC thread holds onto the historically largest direct buffer,
possibly for a long time. This could be alleviated by periodically
closing the connection and thereby releasing a potentially large
buffer and replacing it with a new thread that starts with a clean
slate. If all queries have large result sets, this solution won't
help. Another alternative is to read the message incrementally rather
than buffering it in its entirety in a byte array as ITC currently
does. A third and possibly the simplest solution would be to read the
messages into the buffer in chunks of say 1M. DIS has offers
readFully( data, offset, length ) for that. I have tried this solution
and it fixes this problem for us. I'll open an issue and submit my
patch. We have observed the issue with 0.6.12 but from looking at ITC
in trunk it seems to be affected too.

It gets worse though: even after the ITC thread dies, the cached
buffers stick around as they are being held via SoftReferences. SR's
are released only as a last resort to prevent an OutOfMemoryException.
Using SR's for caching direct buffers is silly because direct buffers
have negligible impact on the Java heap but may have dramatic impact
on the native heap. I am not the only one who thinks so [1]. In other
words, sun.nio.ch.Util's buffer caching is severely broken. I tried to
find a way to explicitly release soft references but haven't found
anything other than the allocation of an oversized array to force an
OutOfMemoryException. The only thing we can do is to keep the buffer
sizes small in order to reduce the impact of the leak. My patch takes
care of that.

I will post a link to the JIRA issue with the patch shortly.

[1] http://bugs.sun.com/view_bug.do?bug_id=3D6210541

On Wed, May 4, 2011 at 11:50 AM, Hannes Schmidt <hannes@eyealike.com> wrote=
:
> Hi,
>
> We are using Cassandra 0.6.12 in a cluster of 9 nodes. Each node is
> 64-bit, has 4 cores and 4G of RAM and runs on Ubuntu Lucid with the
> stock 2.6.32-31-generic kernel. We use the Sun/Oracle JDK.
>
> Here's the problem: The Cassandra process starts up with 1.1G resident
> memory (according to top) but slowly grows to 2.1G at a rate that
> seems proportional to the write load. No writes, no growth. The node
> is running other memory-sensitive applications (a second JVM for our
> in-house webapp and a short-lived C++ program) so we need to ensure
> that each process stays within certain bounds as far as memory
> requirements go. The nodes OOM and crash when the Cassandra process is
> at 2.1G so I can't say if the growth is bounded or not.
>
> Looking at the=A0/proc/$pid/smaps=A0for the Cassandra process it seems to
> me that it is the native heap of the Cassandra JVM that is leaking. I
> attached a readable version of the smaps file generated by [1].
>
> Some more data: Cassandra runs with default command line arguments,
> which means it gets 1G heap. The JNA jar is present and Cassandra logs
> that the memory locking was successful. In storage-conf.xml,
> DiskAccessMode is mmap_index_only. Other than that and some increased
> timeouts we left the defaults. Swap is completely disabled. I don't
> think this is related but I am mentioning it anyways: overcommit [2]
> is always-on (vm.overcommit_memory=3D1). Without that we get OOMs when
> our application JVM is fork()'ing and exec()'ing our C++program even
> though there is enough free RAM to satisfy the demands of the C++
> program. We think this is caused by a flawed kernel heuristic that
> assumes that the forked process (our C++ app) is as big as the forking
> one (the 2nd JVM). Anyways, the Cassandra process leaks with both,
> vm.overcommit_memory=3D0 (the default) and 1.
>
> Whether it is the native heap that leaks or something else, I think
> that 1.1G of additional RAM for 1G of Java heap can't be normal. I'd
> be grateful for any insights or pointers at what to try next.
>
> [1] http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html
> [2] http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6
>