Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A7F996EC for ; Wed, 7 Dec 2011 01:11:12 +0000 (UTC) Received: (qmail 59462 invoked by uid 500); 7 Dec 2011 01:11:10 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 59431 invoked by uid 500); 7 Dec 2011 01:11:10 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 59423 invoked by uid 99); 7 Dec 2011 01:11:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Dec 2011 01:11:10 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of derek@klout.com designates 74.125.149.75 as permitted sender) Received: from [74.125.149.75] (HELO na3sys009aog105.obsmtp.com) (74.125.149.75) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 07 Dec 2011 01:11:02 +0000 Received: from mail-iy0-f173.google.com ([209.85.210.173]) (using TLSv1) by na3sys009aob105.postini.com ([74.125.148.12]) with SMTP ID DSNKTt69EUeGteV5G4dCbu7fG4Zf9LycAVsv@postini.com; Tue, 06 Dec 2011 17:10:42 PST Received: by mail-iy0-f173.google.com with SMTP id j26so42309iaf.32 for ; Tue, 06 Dec 2011 17:10:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.42.117.193 with SMTP id u1mr17995270icq.24.1323220241612; Tue, 06 Dec 2011 17:10:41 -0800 (PST) Received: by 10.50.106.134 with HTTP; Tue, 6 Dec 2011 17:10:41 -0800 (PST) Date: Tue, 6 Dec 2011 17:10:41 -0800 Message-ID: Subject: Help Tuning HBase GC Settings From: Derek Wollenstein To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf303bf79275ff6d04b376386a --20cf303bf79275ff6d04b376386a Content-Type: text/plain; charset=ISO-8859-1 I've been working on improving GC settings for HBase RegionServer instances, and I seem to have run into a bit of a dead end. Basically I've been trying to tune GC settings and memory settings appropriately, but I keep on having my server reach something like GC Death. My current memory settings are HEAPSIZE=24576 -ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:NewSize=192m -XX:MaxNewSize=192m -XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log -XX:ParallelGCThreads=6 We've also set hfile.block.cache.size to 0.5 ( believing that incresing the space for cache should improve performance; I'm willing to accept that this could be causing problems, I just haven't seen this reported) Basically as the service is "warming up", we see reasonable GC Log timings. ... 2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark: 2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start] 2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686 secs] [Times: user=1.98 sys=0.05, real=0.69 secs] 2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start] 2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 2011-12-06T10:19:15.534+0000: 783.172: [CMS-concurrent-abortable-preclean-start] 2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew: 168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K), 0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs] 2011-12-06T10:19:16.351+0000: 783.990: [CMS-concurrent-abortable-preclean: 0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs] 2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960 K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)] 2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] ... This seems to continue for about for about 16 hours But as we hit some critical time, we end up stuck running multiple young GC/second. Note that they at appear to be successful (if I'm reading the logs correctly) 011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew: 166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K), 0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs] 2011-12-06T23:49:42.522+0000: 49410.161: [CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04 sys=0.02, real=0.58 secs] 2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960 K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1 9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17 sys=0.01, real=0.03 secs] 2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start] 2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew: 165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K), 0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] 2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew: 165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K), 0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625 secs] [Times: user=1.73 sys=0.02, real=0.62 secs] 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start] 2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051 secs] [Times: user=0.10 sys=0.00, real=0.05 secs] 2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew: 163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K), 0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs] 2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew: 164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K), 0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs] 2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew: 165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K), 0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs] 2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew: 165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K), 0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] 2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew: 167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K), 0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] ... Note that we are doing about 2 YGC/second (about 80ms). I believe that this situation represents us using up some undesirable amount of memory resources, but I can't find any particular reason why we're going from using a reasonable amount of GC to using GC constantly. One theory is that this is the result of something in the young generation being promoted? But I'm not really sure what would be causing this pathological behavior. Any ideas are alternative places to look would be greatly appreciated. --Derek --20cf303bf79275ff6d04b376386a--