hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-3206) Detect slow GC loops of death
Date Tue, 12 Aug 2014 19:13:12 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andrew Purtell resolved HBASE-3206.

    Resolution: Not a Problem

These days we have o.a.h.h.util.JvmPauseMonitor

> Detect slow GC loops of death
> -----------------------------
>                 Key: HBASE-3206
>                 URL: https://issues.apache.org/jira/browse/HBASE-3206
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
> Something that has been bothering me for a while was to understand when a region server
was being slow because of frequent and small GC pauses. I usually go into that RS's GC output,
watch it going for a while then decide if it's under some kind of memory pressure. Here's
an example (grepped "Full" from the GC log):
> {noformat}
> 12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: [CMS-concurrent-mark: 4.381/5.819
secs] [Times: user=60.51 sys=2.54, real=5.82 secs] 
> 12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: [CMS-concurrent-mark: 4.006/5.080
secs] [Times: user=55.16 sys=2.13, real=5.08 secs] 
> 12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: [CMS-concurrent-mark: 4.133/5.303
secs] [Times: user=53.61 sys=2.40, real=5.30 secs] 
> 12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: [CMS-concurrent-sweep:
1.325/1.388 secs] [Times: user=4.66 sys=0.15, real=1.38 secs] 
> 12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: [CMS-concurrent-mark: 4.831/6.346
secs] [Times: user=69.43 sys=2.76, real=6.35 secs] 
> 12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: [CMS-concurrent-mark: 4.942/7.010
secs] [Times: user=69.25 sys=2.69, real=7.01 secs] 
> 12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: [CMS-concurrent-sweep:
1.197/1.202 secs] [Times: user=1.96 sys=0.04, real=1.20 secs] 
> 12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: [CMS-concurrent-mark: 4.154/5.443
secs] [Times: user=61.11 sys=2.29, real=5.44 secs] 
> 12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: [CMS-concurrent-mark: 4.447/5.713
secs] [Times: user=65.19 sys=2.42, real=5.72 secs] 
> 12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: [CMS-concurrent-mark: 4.025/5.053
secs] [Times: user=51.40 sys=2.02, real=5.06 secs]
> {noformat}
> In this case, that RS's TT was down so it was getting all the non-local maps at the end
of the job at the same time... generating a >1000% CPU usage. With scanner caching set
to 10k, it's easy to understand that there's memory pressure since we have all those objects
in flight that we don't account for.
> One solution I was thinking of was to have a sleeper thread that sleeps for 1 sec all
the time and outputs when it sees that it slept for a bit more than 1 sec. Then let's say
the region server records that it saw a few of those under x minutes and decides to somehow
throttle the traffic.
> What I often saw is that if this situation is kept unnoticed, we end up GCing more and
more and in some cases I saw a region server going almost zombie for 2 hours before finally
getting it's lease expired.

This message was sent by Atlassian JIRA

View raw message