Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 30C42200BE7 for ; Mon, 14 Nov 2016 10:51:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2F6CD160B05; Mon, 14 Nov 2016 09:51:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 55DA0160B0D for ; Mon, 14 Nov 2016 10:51:00 +0100 (CET) Received: (qmail 92291 invoked by uid 500); 14 Nov 2016 09:50:59 -0000 Mailing-List: contact dev-help@brooklyn.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@brooklyn.apache.org Delivered-To: mailing list dev@brooklyn.apache.org Received: (qmail 92228 invoked by uid 99); 14 Nov 2016 09:50:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Nov 2016 09:50:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 62A4D2C0059 for ; Mon, 14 Nov 2016 09:50:59 +0000 (UTC) Date: Mon, 14 Nov 2016 09:50:59 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@brooklyn.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (BROOKLYN-375) Brooklyn intermittently uses high CPU levels and becomes unresponsive MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 14 Nov 2016 09:51:01 -0000 [ https://issues.apache.org/jira/browse/BROOKLYN-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663322#comment-15663322 ] ASF GitHub Bot commented on BROOKLYN-375: ----------------------------------------- Github user aledsage commented on a diff in the pull request: https://github.com/apache/brooklyn-docs/pull/122#discussion_r87766901 --- Diff: guide/ops/troubleshooting/memory-usage.md --- @@ -0,0 +1,138 @@ +--- +layout: website-normal +title: "Troubleshooting: Monitoring Memory Usage" +toc: /guide/toc.json +--- + +## Memory Usage + +Brooklyn tries to keep in memory as much history of its activity as possible, +for displaying through the UI, so it is normal for it to consume as much memory +as it can. It uses "soft references" so these objects will be cleared if needed, +but **it is not a sign of anything unusual if Brooklyn is using all its available memory**. + +The number of active tasks, CPU usage, thread counts, and +retention of soft reference objects are a much better indication of load. +This information can be found by looking in the log for lines containing +`brooklyn gc`, such as: + + 2016-09-16 16:19:43,337 DEBUG o.a.b.c.m.i.BrooklynGarbageCollector [brooklyn-gc]: brooklyn gc (before) - using 910 MB / 3.76 GB memory; 98% soft-reference maybe retention (of 362); 35 threads; tasks: 0 active, 2 unfinished; 31 remembered, 1013 total submitted) + +The soft-reference figure is indicative, but the lower this is, the more +the JVM has decided to get rid of items that were desired to be kept but optional. +It only tracks some soft-references (those wrapped in `Maybe`), +and of course if there are many many such items the JVM will have to get rid +of some, so a lower figure does not necessarily mean a problem. +Typically however if there's no `OutOfMemoryError` (OOME) reported, +there's no problem. + + +## Problem Indicators and Resolutions + +Two things that *do* normally indicate a problem with memory are: + +* `OutOfMemoryError` exceptions being thrown +* Memory usage high *and* CPU high, where the CPU is spent doing full garbage collection + +One possible cause is the JVM doing a poorly-selected GC strategy, +as described in [Oracle Java bug 6912889](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6912889). +This can be confirmed by running the "analyzing soft reference usage" technique below; +memory should shrink dramatically then increase until the problem recurs. +This can be fixed by passing `-XX:SoftRefLRUPolicyMSPerMB=1` to the JVM, +as described in [Brooklyn issue 375](https://issues.apache.org/jira/browse/BROOKLYN-375). + +Other common JVM options include `-Xms256m -Xmx1g -XX:MaxPermSize=256m` +(depending on JVM provider and version) to set the right balance of memory allocation. +In some cases a larger `-Xmx` value may simply be the fix +(but this should not be the case unless many or large blueprints are being used). + +If the problem is not with soft references but with real memory usage, +the culprit is likely a memory leak, typically in blueprint design. +An early warning of this situation is the "soft-reference maybe retention" level decreasing. +In these situations, follow the steps as described below for "Investigating Leaks". + + +## Analyzing Soft Reference Usage + +If you are concerned about memory usage, or doing evaluation on test environments, +the following method (in the Groovy console) can be invoked to force the system to +reclaim as much memory as possible, including *all* soft references: + + org.apache.brooklyn.util.javalang.MemoryUsageTracker.forceClearSoftReferences() + +In good situations, memory usage should return to a small level. +This call can be disruptive to the system however so use with care. + +The above method can also be configured to run automatically when memory usage +is detected to hit a certain level. That can be useful if external policies are +being used to warn on high memory usage, and you want to keep some headroom. +Many JVM authorities discourage interfering with its garbage collector, however, +so use with care and study the particular JVM you are using. +See the class `BrooklynGarbageCollector` for more information. + + +## Investigating Leaks + +If a memory leak is found, the first place to look should be the WARN/ERROR logs. +Many common causes of leaks, including as runaway tasks and cyclic dependent configuration, +will show their own log errors prior to the memory error. + +You should also note the task counts in the `brooklyn gc` messages described above, +and if there are an exceptional number of tasks or tasks are not clearing, +other log messages will describe what is happening, and the in-product task +view can indicate issues. + +Sometimes slow leaks can occur if blueprints do not clean up entities or locations. +These can be diagnosed by noting the number of files written to the persistence location, +if persistence is being used. Deploying then destroying a blueprint should not leave +anything behind in the persistence directory. + +Where problems have been encountered in the past, we have resolved them and/or +worked to improve logging and early identification. +Please report any issues so that we can improve this further. +In many cases we can also give advice on what other log `grep` patterns can be useful. + + +### Standard Java Techniques + +Useful standard Java techniques for tracking memory leaks include: + +* `jstack ` to see what tasks are running +* `jmap -histo:live ` to see what objects are using memory (see below) +* Memory profilers such as VisualVM or Eclipse MAT, either connected to a running system or + against a heap dump generated on an OOME + +More information is available on [the Oracle Java web site](https://docs.oracle.com/javase/7/docs/webnotes/tsg/TSG-VM/html/memleaks.html). + +Note that some of the above techniques will often include soft and weak references that are irrelevant +to the problem (and will be cleared on an OOME). Objects that may be cached in that way include: + +* `BasicConfigKey` (used for the web server and many blueprints) +* `DslComponent` and `*Task` (used for Brooklyn activities and dependent configuration) +* `jclouds` items including `ImageImpl` (to cache data on cloud service providers) + +On the other hand any of the above may also indicate a leak. +Taking snapshots after a `forceClearSoftReferences()` (above) invocation and comparing those +is one technique to filter out noise. Another is to wait until there is an OOME +and look just after, because that will clear all non-essential data from memory. +(The `forceClearSoftReferences()` actually works by triggering an OOME, in as safe +a way as possible.) + +If leaked items are found, a profiler will normally let you see their content +and walk backwards along their references to find out why they are being retained. + + +### Summary of Techniques + +The following sequence of techniques is a common approach to investigating and fixing memory issues: + +* Note the log lines about `brooklyn gc`, including memory and tasks +* Do not assume high memory usage alone is an error, as soft reference caches are deliberate; + use `forceClearSoftReferences()` to clear these --- End diff -- @ahgittin (cc @neykov) I thought we were not going to recommend using `forceClearSoftReferences()` this in any kind of production environment. Can we put in a caveat here about not using it in production. I'd be extremely caution about encouraging real users to call this until devs have been using it in anger themselves a lot. With the use of the (much safer) `-XX:SoftRefLRUPolicyMSPerMB=1`, I'd expect the need for calling this would be greatly reduced. > Brooklyn intermittently uses high CPU levels and becomes unresponsive > --------------------------------------------------------------------- > > Key: BROOKLYN-375 > URL: https://issues.apache.org/jira/browse/BROOKLYN-375 > Project: Brooklyn > Issue Type: Bug > Environment: OSX Sierra, Java 1.7 > Reporter: Duncan Godwin > > Intermittently whilst launching a clocker swarm within brooklyn, it uses high CPU levels and becomes unresponsive. This was noted when testing failover by manally stopping some nodes with `shutdown -h`. > [jstack 1|https://gist.github.com/drigodwin/c5946d23ed11350f393d9ba9b80a2a2d] > [jstack 2|https://gist.github.com/drigodwin/5619b02c0c1d53ceb0c99234d8f0dd96] > [jclouds.debug.log|https://gist.github.com/drigodwin/365d39d216e6a56c634a5020496ef8f1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)