incubator-kato-spec mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Hall <>
Subject Re: Kato & Native Memory
Date Fri, 26 Jun 2009 08:29:04 GMT
On Thu, Jun 25, 2009 at 5:16 PM, Steve Poole<> wrote:
> On Thu, Jun 25, 2009 at 4:54 PM, Andrew Hall <>wrote:
>> Many users who hit a native OOM have no idea what to do next. There's
>> very little visibility into the native heap and often the only thing
>> they can do is raise a support request on one of the software vendors
>> in the stack.
> Um,  are you saying that you want to expose this information just so
> customers can beat you up about it  ? :-)

At the moment the immediate assumption is often that it's the JVM's
fault. I very much doubt we'll increase the number of OOMs being
reported by exposing this data - at worst it will stay the same. We
may see more queries of the form "you are using X MB for system Y, and
we think that's too much" - but I'd much rather have that discussion.

> Lets assume then that the stats data was made available - from a customer
> perspective it would look something like this?
> threads   2000 entries,  100MB storage
> classloaders  50000 entries 2500MB storage
> other stuff  1000000 entries  1000MB storage

Essentially yes - although it makes more sense to me as a tree view:

SDK - 1000 allocations, 100MB
 --Class libraries 300 allocations 30MB
 |       |
 |       - Direct Byte Buffers 100 allocations 27MB
 --VM 700 allocations 70MB

etc. etc.

>  I really want to understand what happens when a native memory problem is
> encountered - from how a customer first realises that its happened right
> though to how a JVM/JIT developer figures out the underlying bug.


Going native OOM will mean a call to malloc() returning NULL in some
native code somewhere in the Java process. The code that called malloc
will then do one of five things:

* Crash with a null pointer dereference because it didn't check the
return code from malloc. Bug gets raised with whomever owns the code
that crashed.
* Throw a java.lang.OutOfMemoryError, possibly with a useful message.
(After determining it's not Java heap - which may take a long time -
raise a defect).
* Print a message to the console (Customer raises a defect)
* Call abort() or exit(), possibly after printing a message. (Customer
raises a defect)
* Silently do nothing, in which case some future malloc() will detect
the problem.

The only situation that's very different is on 64 bit platforms, where
customers either spot massive process sizes or run out of swap space,
the machine either becomes unusable and the Java process is killed by
the OS. That would often cause a defect to be raised.

The support engineer will fairly quickly conclude there's a native OOM
from the supplied documentation (obvious from a core file).

The first stage is normally to rerun the scenario with some kind of
process monitoring to get the native memory usage profile. This will
either show a leak/growth that needs to be investigated, or will show
a footprint issue - starting the JVM with settings that nearly exhaust
the process address space such that a few JIT compiles or starting a
few threads force it OOM.

Assuming you have a native leak or growth the next step is to figure
out what is growing.

If the JVM has a measure of its native memory footprint, at this point
you can say whether you're looking inside the JDK or outside it for
the leak. It can also show if the growth is caused by what some people
call "iceberg objects" - Java heap objects that have a small footprint
on the Java heap, but a large native footprint. E.g. Direct
ByteBuffers or Type 2 JDBC drivers.

If you either don't have the JVM data, or the data shows it's outside
of the JDK the next step depends on tooling. If the platform has a
native heap tracker (such as UMDH on Windows), the next step is to run
with that in place to get a native stack trace for the code causing
the leak.

If you determine that the growth is caused by iceberg objects (either
from JVM information or through native tracking) then you need to use
some combination of a heap analysis tool such as MAT or JVM tracing to
identify why they are being created and why they are being retained.
The problem is then one of retained Java heap objects and is
relatively easy to solve.

If you determine the growth is due to a piece of native code not
connected to any Java object (inside the JDK or outside of it) then
whomever owns the code has to add trace or inspect that code until the
leak is understood and fixed.

If you can't identify the leaking suspect via native tracking (because
the tool doesn't exist on the platform or because it's not working)
then you're left with:
* Shortening the list of possible leakers by removing a native library
at a time or stubbing out bits of function
* Recompiling or relinking each library with a memory debugging library
* Analyzing the leaked memory itself to see if you can identify who
allocated it by its contents
* Extracting the malloc chains and looking for patterns in the
allocation size and allocating thread (if you have it) to identify who
the allocator was
* Engaging with the operating system vendor to see if they have any better tools

If you get to this point, the chances of solving the issue are much reduced.

Having memory counters in the runtime itself gives you quite a lot:
* It exposes details of some iceberg objects (those in the SDK class
libs) to users. Many of whom are happy to use a heap analyzer to
debug, even if they don't understand native memory.
* It makes the early stages of defect triage easier, reducing the
number of reruns required to solve the issue
* Some users have questions like "I asked for a 1GB heap, but my Java
runtime is taking up 1.5GB of physical memory. Why is that?"
* When trying to cram as the biggest possible application into a 32
bit address space (because 32 bit VMs are typically faster than 64
bit) it allows you to plan footprint better.

>> >
>> <snip>

View raw message