hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject Re: gc pause killing regionserver
Date Tue, 06 Mar 2012 10:37:52 GMT
Thanks for the replies. That is a lot of useful information. I admit that
I'm still running a bit behind when it comes to truly understanding GC
mechanics and how HBase deals with it. I find the following resource also
useful for understanding GC:
https://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf

> Why use the following in your config?
I use these GC tuning options because I found them somewere on the mailing
list advertised as generally advised GC options. I think it would be nice
if HBase ref guide recommends default GC settings, I can imagine that they
are different for different heap sizes.

> It's possible that you actually are in swap due to uncollected off-heap
> memory allocations.  I doubt that even severe fragmentation on the heap
> would cause that kind of slowdown
Munin does show some minor swapping (and memory overcommit), but
considering the amount of free space left (os disk cache) and the fact that
swapiness is set to zero, I was under the impression that it was harmless.
On second thought I will deep digger into this.

> If I'm reading your log correctly, you have about 2.5 GB of heap, right?
That's right, 2100100100 bytes to be exact.

> Is this server has the same load as the other ones?
Yes. They all run about the same amount of regions and generally have the
same load. The hardware is (should be) identical.

Ferdy.


On Mon, Mar 5, 2012 at 10:38 PM, Mikael Sitruk <mikael.sitruk@gmail.com>wrote:

> Try to set the initialOccupancy to a lower value and try to allocate more
> space to the new generation space.
> It seems that you don't have enough place in the survivor space, and
> therefore you have a promotion failure.
>
> You also mention that the same server is frequently having this problem, is
> it because of your data spread? Is this server has the same load as the
> other ones?
>
>
>
> On Mon, Mar 5, 2012 at 11:00 PM, Sandy Pratt <prattrs@adobe.com> wrote:
>
> > What was the actual process size of the JVM as reported by top?
> >
> > Why use the following in your config?
> >
> > -XX:NewRatio=16
> > -XX:MaxGCPauseMillis=100
> >
> > Do you really have a stringent latency target, or are you just being
> > aggressive?
> >
> > If I'm reading your log correctly, you have about 2.5 GB of heap, right?
> >  If so, your full GC times are too long by over an order of magnitude.
> >  Here is a snippet from one of my servers for comparison:
> >
> > 2393278.112: [GC [PSYoungGen: 453884K->221141K(466048K)]
> > 1716800K->1510206K(1864192K), 0.1394090 secs] [Times: user=0.55 sys=0.00,
> > real=0.14 secs]
> > 2393281.457: [GC [PSYoungGen: 454229K->181061K(466048K)]
> > 1743294K->1529614K(1864192K), 0.1377460 secs] [Times: user=0.54 sys=0.00,
> > real=0.13 secs]
> > 2393281.595: [Full GC [PSYoungGen: 181061K->0K(466048K)] [ParOldGen:
> > 1348552K->985640K(1398144K)] 1529614K->985640K(1864192K) [PSPermGen:
> > 21228K->21225K(21504K)], 0.6138910 secs] [Times: user=2.16 sys=0.02,
> > real=0.62 secs]
> > 2393285.890: [GC [PSYoungGen: 233088K->33764K(466048K)]
> > 1218728K->1019404K(1864192K), 0.0228820 secs] [Times: user=0.09 sys=0.00,
> > real=0.02 secs]
> > ...
> > 2393323.370: [GC [PSYoungGen: 272556K->32976K(466048K)]
> > 1581838K->1381385K(1864192K), 0.0444630 secs] [Times: user=0.18 sys=0.00,
> > real=0.04 secs]
> > 2393323.415: [Full GC [PSYoungGen: 32976K->0K(466048K)] [ParOldGen:
> > 1348409K->975392K(1398144K)] 1381385K->975392K(1864192K) [PSPermGen:
> > 21225K->21225K(21504K)], 0.5755400 secs] [Times: user=2.08 sys=0.00,
> > real=0.58 secs]
> > 2393326.708: [GC [PSYoungGen: 233088K->32236K(470528K)]
> > 1208480K->1007628K(1868672K), 0.0222720 secs] [Times: user=0.08 sys=0.01,
> > real=0.02 secs]
> > 2393329.372: [GC [PSYoungGen: 269804K->30186K(466048K)]
> > 1245196K->1037366K(1864192K), 0.0377520 secs] [Times: user=0.14 sys=0.00,
> > real=0.04 secs]
> >
> > Granted, your heap is a little larger, but not that much larger.
> >
> > ~600 ms might be too long to wait for you, so you might not want to use
> > the parallel compacting collector.  Still, it seems that even when you
> fall
> > back to the serial collector during concurrent mode failure, your pauses
> > should be more in the 3-5 second range (just guessing -- see user times
> > above).
> >
> > It's possible that you actually are in swap due to uncollected off-heap
> > memory allocations.  I doubt that even severe fragmentation on the heap
> > would cause that kind of slowdown.  Remember, the JVM memory footprint is
> > not only Java heap.  In fact, I have recently seen situations with both
> > JBoss and HBase where heap size is a small fraction of JVM process size.
> >  This seems to occur most often in lightly loaded installations.  It
> might
> > be interesting to have a look around your processes with pmap or some
> tool
> > like that.  In many situations, you'll probably be able to account for
> each
> > Java heap generation allocation easily enough, then be able to identify a
> > succession of other allocations that aren't part of the Java heap.  Those
> > latter parts are likely to be your problem, if my guess is on the money.
> >
> > There's a little info about using pmap against the JVM here:
> >
> > http://www.kdgregory.com/index.php?page=java.outOfMemory
> >
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Ferdy Galema [mailto:ferdy.galema@kalooga.com]
> > > Sent: Saturday, March 03, 2012 05:05
> > > To: user@hbase.apache.org
> > > Subject: gc pause killing regionserver
> > >
> > > Hi,
> > >
> > > I'm running regionservers with 2GB heap and following tuning options:
> > > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
> > > -XX:CMSInitiatingOccupancyFraction=70 -
> > > XX:+UseCMSInitiatingOccupancyOnly
> > > -XX:MaxGCPauseMillis=100
> > >
> > > A regionserver aborted (YouAreDeadException) and this was printed in
> the
> > > gc logs (all is shown up until the abort)
> > >
> > > 211663.516: [GC 211663.516: [ParNew: 118715K->13184K(118912K),
> 0.0445390
> > > secs] 1373940K->1289814K(2233472K), 0.0446420 secs] [Times: user=0.14
> > > sys=0.01, real=0.05 secs]
> > > 211663.686: [GC 211663.686: [ParNew: 118912K->13184K(118912K),
> 0.0594280
> > > secs] 1395542K->1310185K(2233472K), 0.0595420 secs] [Times: user=0.15
> > > sys=0.00, real=0.06 secs]
> > > 211663.869: [GC 211663.869: [ParNew: 118790K->13184K(118912K),
> 0.0434820
> > > secs] 1415792K->1331317K(2233472K), 0.0435930 secs] [Times: user=0.13
> > > sys=0.01, real=0.04 secs]
> > > 211667.598: [GC 211667.598: [ParNew (promotion failed):
> > > 118912K->118912K(118912K), 0.0225390 secs]211667.621: [CMS:
> > > 1330845K->1127914K(2114560K), 51.3610670 secs]
> > > 1437045K->1127914K(2233472K), [CMS Perm : 20680K->20622K(34504K)],
> > > 51.3838170 secs] [Times: user=1.82 sys=0.31, real=51.38 secs]
> > > 211719.713: [GC 211719.714: [ParNew: 105723K->13184K(118912K),
> 0.0176130
> > > secs] 1233638K->1149393K(2233472K), 0.0177230 secs] [Times: user=0.07
> > > sys=0.00, real=0.02 secs]
> > > 211719.851: [GC 211719.852: [ParNew: 118912K->13184K(118912K),
> 0.0281860
> > > secs] 1255121K->1170269K(2233472K), 0.0282970 secs] [Times: user=0.10
> > > sys=0.01, real=0.03 secs]
> > > 211719.993: [GC 211719.993: [ParNew: 118795K->13184K(118912K),
> 0.0276320
> > > secs] 1275880K->1191268K(2233472K), 0.0277350 secs] [Times: user=0.09
> > > sys=0.00, real=0.03 secs]
> > > 211720.490: [GC 211720.490: [ParNew: 118912K->13184K(118912K),
> 0.0624650
> > > secs] 1296996K->1210640K(2233472K), 0.0625560 secs] [Times: user=0.15
> > > sys=0.00, real=0.06 secs]
> > > 211720.687: [GC 211720.687: [ParNew: 118702K->13184K(118912K),
> 0.1651750
> > > secs] 1316159K->1231993K(2233472K), 0.1652660 secs] [Times: user=0.25
> > > sys=0.01, real=0.17 secs]
> > > 211721.038: [GC 211721.038: [ParNew: 118912K->13184K(118912K),
> 0.0952750
> > > secs] 1337721K->1252598K(2233472K), 0.0953660 secs] [Times: user=0.15
> > > sys=0.00, real=0.09 secs] Heap  par new generation  total 118912K, used
> > > 86199K [0x00002aaaae1f0000, 0x00002aaab62f0000, 0x00002aaab62f0000)
> > >   eden space 105728K,  69% used [0x00002aaaae1f0000,
> 0x00002aaab293dfa8,
> > > 0x00002aaab4930000)
> > >   from space 13184K, 100% used [0x00002aaab4930000, 0x00002aaab5610000,
> > > 0x00002aaab5610000)
> > >   to  space 13184K,  0% used [0x00002aaab5610000, 0x00002aaab5610000,
> > > 0x00002aaab62f0000)
> > >  concurrent mark-sweep generation total 2114560K, used 1239414K
> > > [0x00002aaab62f0000, 0x00002aab373f0000, 0x00002aab373f0000)
> > > concurrent-mark-sweep perm gen total 34504K, used 20728K
> > > [0x00002aab373f0000, 0x00002aab395a2000, 0x00002aab3c7f0000)
> > >
> > >
> > > Why did a GC took 51 seconds? The machine still had enough memory
> > > available so it could not be swapping. (swapiness is set to 0). From
> the
> > 15
> > > regionservers in total, I often see this specific regionserver fail.
> > What do you
> > > recommended in this situation?
> > >
> > > Ferdy.
> >
>
>
>
> --
> Mikael.S
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message