brooklyn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (BROOKLYN-375) Brooklyn intermittently uses high CPU levels and becomes unresponsive
Date Mon, 14 Nov 2016 09:50:59 GMT


ASF GitHub Bot commented on BROOKLYN-375:

Github user aledsage commented on a diff in the pull request:
    --- Diff: guide/ops/troubleshooting/ ---
    @@ -0,0 +1,138 @@
    +layout: website-normal
    +title: "Troubleshooting: Monitoring Memory Usage"
    +toc: /guide/toc.json
    +## Memory Usage
    +Brooklyn tries to keep in memory as much history of its activity as possible,
    +for displaying through the UI, so it is normal for it to consume as much memory
    +as it can.  It uses "soft references" so these objects will be cleared if needed,
    +but **it is not a sign of anything unusual if Brooklyn is using all its available memory**.
    +The number of active tasks, CPU usage, thread counts, and 
    +retention of soft reference objects are a much better indication of load.
    +This information can be found by looking in the log for lines containing
    +`brooklyn gc`, such as:
    +    2016-09-16 16:19:43,337 DEBUG o.a.b.c.m.i.BrooklynGarbageCollector [brooklyn-gc]:
brooklyn gc (before) - using 910 MB / 3.76 GB memory; 98% soft-reference maybe retention (of
362); 35 threads; tasks: 0 active, 2 unfinished; 31 remembered, 1013 total submitted) 
    +The soft-reference figure is indicative, but the lower this is, the more
    +the JVM has decided to get rid of items that were desired to be kept but optional.
    +It only tracks some soft-references (those wrapped in `Maybe`),
    +and of course if there are many many such items the JVM will have to get rid
    +of some, so a lower figure does not necessarily mean a problem.
    +Typically however if there's no `OutOfMemoryError` (OOME) reported,
    +there's no problem.
    +## Problem Indicators and Resolutions
    +Two things that *do* normally indicate a problem with memory are:
    +* `OutOfMemoryError` exceptions being thrown
    +* Memory usage high *and* CPU high, where the CPU is spent doing full garbage collection
    +One possible cause is the JVM doing a poorly-selected GC strategy,
    +as described in [Oracle Java bug 6912889](
    +This can be confirmed by running the "analyzing soft reference usage" technique below;
    +memory should shrink dramatically then increase until the problem recurs.
    +This can be fixed by passing `-XX:SoftRefLRUPolicyMSPerMB=1` to the JVM,
    +as described in [Brooklyn issue 375](
    +Other common JVM options include `-Xms256m -Xmx1g -XX:MaxPermSize=256m`
    +(depending on JVM provider and version) to set the right balance of memory allocation.
    +In some cases a larger `-Xmx` value may simply be the fix
    +(but this should not be the case unless many or large blueprints are being used).
    +If the problem is not with soft references but with real memory usage,
    +the culprit is likely a memory leak, typically in blueprint design.
    +An early warning of this situation is the "soft-reference maybe retention" level decreasing.
    +In these situations, follow the steps as described below for "Investigating Leaks".
    +## Analyzing Soft Reference Usage
    +If you are concerned about memory usage, or doing evaluation on test environments, 
    +the following method (in the Groovy console) can be invoked to force the system to
    +reclaim as much memory as possible, including *all* soft references:
    +    org.apache.brooklyn.util.javalang.MemoryUsageTracker.forceClearSoftReferences()
    +In good situations, memory usage should return to a small level.  
    +This call can be disruptive to the system however so use with care.
    +The above method can also be configured to run automatically when memory usage 
    +is detected to hit a certain level.  That can be useful if external policies are
    +being used to warn on high memory usage, and you want to keep some headroom.
    +Many JVM authorities discourage interfering with its garbage collector, however,
    +so use with care and study the particular JVM you are using.
    +See the class `BrooklynGarbageCollector` for more information.
    +## Investigating Leaks
    +If a memory leak is found, the first place to look should be the WARN/ERROR logs.
    +Many common causes of leaks, including as runaway tasks and cyclic dependent configuration,
    +will show their own log errors prior to the memory error.
    +You should also note the task counts in the `brooklyn gc` messages described above,
    +and if there are an exceptional number of tasks or tasks are not clearing,
    +other log messages will describe what is happening, and the in-product task
    +view can indicate issues. 
    +Sometimes slow leaks can occur if blueprints do not clean up entities or locations.
    +These can be diagnosed by noting the number of files written to the persistence location,
    +if persistence is being used.  Deploying then destroying a blueprint should not leave
    +anything behind in the persistence directory.
    +Where problems have been encountered in the past, we have resolved them and/or
    +worked to improve logging and early identification.
    +Please report any issues so that we can improve this further.
    +In many cases we can also give advice on what other log `grep` patterns can be useful.
    +### Standard Java Techniques
    +Useful standard Java techniques for tracking memory leaks include:
    +* `jstack <pid>` to see what tasks are running
    +* `jmap -histo:live <pid>` to see what objects are using memory (see below)
    +* Memory profilers such as VisualVM or Eclipse MAT, either connected to a running system
    +  against a heap dump generated on an OOME
    +More information is available on [the Oracle Java web site](
    +Note that some of the above techniques will often include soft and weak references that
are irrelevant
    +to the problem (and will be cleared on an OOME). Objects that may be cached in that way
    +* `BasicConfigKey` (used for the web server and many blueprints)
    +* `DslComponent` and `*Task` (used for Brooklyn activities and dependent configuration)
    +* `jclouds` items including `ImageImpl` (to cache data on cloud service providers)
    +On the other hand any of the above may also indicate a leak.
    +Taking snapshots after a `forceClearSoftReferences()` (above) invocation and comparing
    +is one technique to filter out noise.  Another is to wait until there is an OOME
    +and look just after, because that will clear all non-essential data from memory.
    +(The `forceClearSoftReferences()` actually works by triggering an OOME, in as safe 
    +a way as possible.)
    +If leaked items are found, a profiler will normally let you see their content
    +and walk backwards along their references to find out why they are being retained.
    +### Summary of Techniques
    +The following sequence of techniques is a common approach to investigating and fixing
memory issues:
    +* Note the log lines about `brooklyn gc`, including memory and tasks
    +* Do not assume high memory usage alone is an error, as soft reference caches are deliberate;

    +  use `forceClearSoftReferences()` to clear these
    --- End diff --
    @ahgittin (cc @neykov) I thought we were not going to recommend using `forceClearSoftReferences()`
this in any kind of production environment. Can we put in a caveat here about not using it
in production. I'd be extremely caution about encouraging real users to call this until devs
have been using it in anger themselves a lot.
    With the use of the (much safer) `-XX:SoftRefLRUPolicyMSPerMB=1`, I'd expect the need
for calling this would be greatly reduced.

> Brooklyn intermittently uses high CPU levels and becomes unresponsive
> ---------------------------------------------------------------------
>                 Key: BROOKLYN-375
>                 URL:
>             Project: Brooklyn
>          Issue Type: Bug
>         Environment: OSX Sierra, Java 1.7
>            Reporter: Duncan Godwin
> Intermittently whilst launching a clocker swarm within brooklyn, it uses high CPU levels
and becomes unresponsive. This was noted when testing failover by manally stopping some nodes
with `shutdown -h`.
> [jstack 1|]
> [jstack 2|]
> [jclouds.debug.log|]

This message was sent by Atlassian JIRA

View raw message