hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikael Sitruk <mikael.sit...@gmail.com>
Subject Re: gc pause killing regionserver
Date Mon, 05 Mar 2012 21:38:14 GMT
Try to set the initialOccupancy to a lower value and try to allocate more
space to the new generation space.
It seems that you don't have enough place in the survivor space, and
therefore you have a promotion failure.

You also mention that the same server is frequently having this problem, is
it because of your data spread? Is this server has the same load as the
other ones?



On Mon, Mar 5, 2012 at 11:00 PM, Sandy Pratt <prattrs@adobe.com> wrote:

> What was the actual process size of the JVM as reported by top?
>
> Why use the following in your config?
>
> -XX:NewRatio=16
> -XX:MaxGCPauseMillis=100
>
> Do you really have a stringent latency target, or are you just being
> aggressive?
>
> If I'm reading your log correctly, you have about 2.5 GB of heap, right?
>  If so, your full GC times are too long by over an order of magnitude.
>  Here is a snippet from one of my servers for comparison:
>
> 2393278.112: [GC [PSYoungGen: 453884K->221141K(466048K)]
> 1716800K->1510206K(1864192K), 0.1394090 secs] [Times: user=0.55 sys=0.00,
> real=0.14 secs]
> 2393281.457: [GC [PSYoungGen: 454229K->181061K(466048K)]
> 1743294K->1529614K(1864192K), 0.1377460 secs] [Times: user=0.54 sys=0.00,
> real=0.13 secs]
> 2393281.595: [Full GC [PSYoungGen: 181061K->0K(466048K)] [ParOldGen:
> 1348552K->985640K(1398144K)] 1529614K->985640K(1864192K) [PSPermGen:
> 21228K->21225K(21504K)], 0.6138910 secs] [Times: user=2.16 sys=0.02,
> real=0.62 secs]
> 2393285.890: [GC [PSYoungGen: 233088K->33764K(466048K)]
> 1218728K->1019404K(1864192K), 0.0228820 secs] [Times: user=0.09 sys=0.00,
> real=0.02 secs]
> ...
> 2393323.370: [GC [PSYoungGen: 272556K->32976K(466048K)]
> 1581838K->1381385K(1864192K), 0.0444630 secs] [Times: user=0.18 sys=0.00,
> real=0.04 secs]
> 2393323.415: [Full GC [PSYoungGen: 32976K->0K(466048K)] [ParOldGen:
> 1348409K->975392K(1398144K)] 1381385K->975392K(1864192K) [PSPermGen:
> 21225K->21225K(21504K)], 0.5755400 secs] [Times: user=2.08 sys=0.00,
> real=0.58 secs]
> 2393326.708: [GC [PSYoungGen: 233088K->32236K(470528K)]
> 1208480K->1007628K(1868672K), 0.0222720 secs] [Times: user=0.08 sys=0.01,
> real=0.02 secs]
> 2393329.372: [GC [PSYoungGen: 269804K->30186K(466048K)]
> 1245196K->1037366K(1864192K), 0.0377520 secs] [Times: user=0.14 sys=0.00,
> real=0.04 secs]
>
> Granted, your heap is a little larger, but not that much larger.
>
> ~600 ms might be too long to wait for you, so you might not want to use
> the parallel compacting collector.  Still, it seems that even when you fall
> back to the serial collector during concurrent mode failure, your pauses
> should be more in the 3-5 second range (just guessing -- see user times
> above).
>
> It's possible that you actually are in swap due to uncollected off-heap
> memory allocations.  I doubt that even severe fragmentation on the heap
> would cause that kind of slowdown.  Remember, the JVM memory footprint is
> not only Java heap.  In fact, I have recently seen situations with both
> JBoss and HBase where heap size is a small fraction of JVM process size.
>  This seems to occur most often in lightly loaded installations.  It might
> be interesting to have a look around your processes with pmap or some tool
> like that.  In many situations, you'll probably be able to account for each
> Java heap generation allocation easily enough, then be able to identify a
> succession of other allocations that aren't part of the Java heap.  Those
> latter parts are likely to be your problem, if my guess is on the money.
>
> There's a little info about using pmap against the JVM here:
>
> http://www.kdgregory.com/index.php?page=java.outOfMemory
>
>
>
>
>
> > -----Original Message-----
> > From: Ferdy Galema [mailto:ferdy.galema@kalooga.com]
> > Sent: Saturday, March 03, 2012 05:05
> > To: user@hbase.apache.org
> > Subject: gc pause killing regionserver
> >
> > Hi,
> >
> > I'm running regionservers with 2GB heap and following tuning options:
> > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
> > -XX:CMSInitiatingOccupancyFraction=70 -
> > XX:+UseCMSInitiatingOccupancyOnly
> > -XX:MaxGCPauseMillis=100
> >
> > A regionserver aborted (YouAreDeadException) and this was printed in the
> > gc logs (all is shown up until the abort)
> >
> > 211663.516: [GC 211663.516: [ParNew: 118715K->13184K(118912K), 0.0445390
> > secs] 1373940K->1289814K(2233472K), 0.0446420 secs] [Times: user=0.14
> > sys=0.01, real=0.05 secs]
> > 211663.686: [GC 211663.686: [ParNew: 118912K->13184K(118912K), 0.0594280
> > secs] 1395542K->1310185K(2233472K), 0.0595420 secs] [Times: user=0.15
> > sys=0.00, real=0.06 secs]
> > 211663.869: [GC 211663.869: [ParNew: 118790K->13184K(118912K), 0.0434820
> > secs] 1415792K->1331317K(2233472K), 0.0435930 secs] [Times: user=0.13
> > sys=0.01, real=0.04 secs]
> > 211667.598: [GC 211667.598: [ParNew (promotion failed):
> > 118912K->118912K(118912K), 0.0225390 secs]211667.621: [CMS:
> > 1330845K->1127914K(2114560K), 51.3610670 secs]
> > 1437045K->1127914K(2233472K), [CMS Perm : 20680K->20622K(34504K)],
> > 51.3838170 secs] [Times: user=1.82 sys=0.31, real=51.38 secs]
> > 211719.713: [GC 211719.714: [ParNew: 105723K->13184K(118912K), 0.0176130
> > secs] 1233638K->1149393K(2233472K), 0.0177230 secs] [Times: user=0.07
> > sys=0.00, real=0.02 secs]
> > 211719.851: [GC 211719.852: [ParNew: 118912K->13184K(118912K), 0.0281860
> > secs] 1255121K->1170269K(2233472K), 0.0282970 secs] [Times: user=0.10
> > sys=0.01, real=0.03 secs]
> > 211719.993: [GC 211719.993: [ParNew: 118795K->13184K(118912K), 0.0276320
> > secs] 1275880K->1191268K(2233472K), 0.0277350 secs] [Times: user=0.09
> > sys=0.00, real=0.03 secs]
> > 211720.490: [GC 211720.490: [ParNew: 118912K->13184K(118912K), 0.0624650
> > secs] 1296996K->1210640K(2233472K), 0.0625560 secs] [Times: user=0.15
> > sys=0.00, real=0.06 secs]
> > 211720.687: [GC 211720.687: [ParNew: 118702K->13184K(118912K), 0.1651750
> > secs] 1316159K->1231993K(2233472K), 0.1652660 secs] [Times: user=0.25
> > sys=0.01, real=0.17 secs]
> > 211721.038: [GC 211721.038: [ParNew: 118912K->13184K(118912K), 0.0952750
> > secs] 1337721K->1252598K(2233472K), 0.0953660 secs] [Times: user=0.15
> > sys=0.00, real=0.09 secs] Heap  par new generation  total 118912K, used
> > 86199K [0x00002aaaae1f0000, 0x00002aaab62f0000, 0x00002aaab62f0000)
> >   eden space 105728K,  69% used [0x00002aaaae1f0000, 0x00002aaab293dfa8,
> > 0x00002aaab4930000)
> >   from space 13184K, 100% used [0x00002aaab4930000, 0x00002aaab5610000,
> > 0x00002aaab5610000)
> >   to  space 13184K,  0% used [0x00002aaab5610000, 0x00002aaab5610000,
> > 0x00002aaab62f0000)
> >  concurrent mark-sweep generation total 2114560K, used 1239414K
> > [0x00002aaab62f0000, 0x00002aab373f0000, 0x00002aab373f0000)
> > concurrent-mark-sweep perm gen total 34504K, used 20728K
> > [0x00002aab373f0000, 0x00002aab395a2000, 0x00002aab3c7f0000)
> >
> >
> > Why did a GC took 51 seconds? The machine still had enough memory
> > available so it could not be swapping. (swapiness is set to 0). From the
> 15
> > regionservers in total, I often see this specific regionserver fail.
> What do you
> > recommended in this situation?
> >
> > Ferdy.
>



-- 
Mikael.S

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message