Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of derek@klout.com designates
 74.125.149.75 as permitted sender)
MIME-Version: 1.0
Date: Tue, 6 Dec 2011 17:10:41 -0800
Message-ID: 
 <CAMZBrq0xf1Ov-3aD576NP6r3WmmMqjZais=RS9qA7qOBa97uhA@mail.gmail.com>
Subject: Help Tuning HBase GC Settings
From: Derek Wollenstein <derek@klout.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=20cf303bf79275ff6d04b376386a

--20cf303bf79275ff6d04b376386a
Content-Type: text/plain; charset=ISO-8859-1

I've been working on improving GC settings for HBase RegionServer
instances, and I seem to have run into a bit of a dead end.

Basically I've been trying to tune GC settings and memory settings
appropriately, but I keep on having my server reach something like GC
Death.

My current memory settings are
HEAPSIZE=24576
-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops
-XX:NewSize=192m -XX:MaxNewSize=192m
-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
-XX:ParallelGCThreads=6

We've also set hfile.block.cache.size to 0.5 ( believing that incresing the
space for cache should improve performance;  I'm willing to accept that
this could be causing problems, I just haven't seen this reported)

Basically as the service is "warming up", we see reasonable GC Log timings.

...
2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times: user=0.00
sys=0.00, real=0.00 secs]
2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686
secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start]
2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2011-12-06T10:19:15.534+0000: 783.172:
[CMS-concurrent-abortable-preclean-start]
2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K),
0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
2011-12-06T10:19:16.351+0000: 783.990: [CMS-concurrent-abortable-preclean:
0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960
K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00, real=0.02
secs]
...
This seems to continue for about for about 16 hours
But as we hit some critical time, we end up stuck running multiple young
GC/second.  Note that they at appear to be successful (if I'm reading the
logs correctly)
011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K),
0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
2011-12-06T23:49:42.522+0000: 49410.161:
[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
sys=0.02, real=0.58 secs]
2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960
K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
sys=0.01, real=0.03 secs]
2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K),
0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K),
0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625
secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051
secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K),
0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K),
0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K),
0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K),
0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K),
0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
...
Note that we are doing about 2 YGC/second (about 80ms). I believe that this
situation represents us using up some undesirable amount of memory
resources, but I can't find any particular reason why we're going from
using a reasonable amount of GC to using GC constantly.   One theory is
that this is the result of something in the young generation being
promoted?  But I'm not really sure what would be causing
this pathological behavior.  Any ideas are alternative places to look would
be greatly appreciated.

--Derek

--20cf303bf79275ff6d04b376386a--