hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17072) CPU usage starts to climb up to 90-100% when using G1GC
Date Tue, 22 Nov 2016 22:43:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15688165#comment-15688165

stack commented on HBASE-17072:

Ok. Looks like I had my test counts wrong way around. I see that the current master branch
does about 1/3rd as many more seeks as the patch does.

Here is the seek that current master does that does not happen with the patch (I have seeks
throwing a RuntimeException in a LOG). It happens 143 times in a small YCSB run that spans
loading and workload c:

26262 2016-11-22 13:40:37,479 INFO  [regionserver/ve0528.halxg.cloudera.com/]
hfile.HFileBlock: COUNT=583
26263 java.lang.RuntimeException
26264   at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1710)
26265   at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1564)
26266   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1484)
26267   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:1124)
26268   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:1112)
26269   at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:286)
26270   at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:182)
26271   at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:386)
26272   at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:294)
26273   at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250)
26274   at org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:515)
26275   at org.apache.hadoop.hbase.regionserver.compactions.Compactor$1.createScanner(Compactor.java:241)
26276   at org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:300)
26277   at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:66)
26278   at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
26279   at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1268)
26280   at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2011)
26281   at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:526)
26282   at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:565)
26283   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
26284   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
26285   at java.lang.Thread.run(Thread.java:745)

The seek that happens with and without patch is the below one which is done on initial open
of the file post compact or on flush or on open of a region:

  613 2016-11-22 13:29:34,730 INFO  [StoreFileOpenerThread-info-1] hfile.HFileBlock: COUNT=1
  614 java.lang.RuntimeException
  615   at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1710)
  616   at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1564)
  617   at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl$2.nextBlock(HFileBlock.java:1464)
  618   at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl$2.nextBlockWithBlockType(HFileBlock.java:1472)
  619   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.<init>(HFileReaderImpl.java:209)
  620   at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:506)
  621   at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:545)
  622   at org.apache.hadoop.hbase.regionserver.StoreFileReader.<init>(StoreFileReader.java:94)
  623   at org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:270)
  624   at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:418)
  625   at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:530)
  626   at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:520)
  627   at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:646)
  628   at org.apache.hadoop.hbase.regionserver.HStore.access$000(HStore.java:113)
  629   at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:513)
  630   at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:510)
  631   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  632   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  633   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  634   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  635   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  636   at java.lang.Thread.run(Thread.java:745)

... even still... unpatched, we do more of the above than w/ patch.

Let me update the patch w/ some better commentary since it is tough digging in on this stuff.

> CPU usage starts to climb up to 90-100% when using G1GC
> -------------------------------------------------------
>                 Key: HBASE-17072
>                 URL: https://issues.apache.org/jira/browse/HBASE-17072
>             Project: HBase
>          Issue Type: Bug
>          Components: Performance, regionserver
>    Affects Versions: 1.0.0, 2.0.0, 1.2.0
>            Reporter: Eiichi Sato
>         Attachments: HBASE-17072.master.001.patch, disable-block-header-cache.patch,
mat-threadlocals.png, mat-threads.png, metrics.png, slave1.svg, slave2.svg, slave3.svg, slave4.svg
> h5. Problem
> CPU usage of a region server in our CDH 5.4.5 cluster, at some point, starts to gradually
get higher up to nearly 90-100% when using G1GC.  We've also run into this problem on CDH
5.7.3 and CDH 5.8.2.
> In our production cluster, it normally takes a few weeks for this to happen after restarting
a RS.  We reproduced this on our test cluster and attached the results.  Please note that,
to make it easy to reproduce, we did some "anti-tuning" on a table when running tests.
> In metrics.png, soon after we started running some workloads against a test cluster (CDH
5.8.2) at about 7 p.m. CPU usage of the two RSs started to rise.  Flame Graphs (slave1.svg
to slave4.svg) are generated from jstack dumps of each RS process around 10:30 a.m. the next
> After investigating heapdumps from another occurrence on a test cluster running CDH 5.7.3,
we found that the ThreadLocalMap contain a lot of contiguous entries of {{HFileBlock$PrefetchedHeader}}
probably due to primary clustering.  This caused more loops in {{ThreadLocalMap#expungeStaleEntries()}},
consuming a certain amount of CPU time.  What is worse is that the method is called from RPC
metrics code, which means even a small amount of per-RPC time soon adds up to a huge amount
of CPU time.
> This is very similar to the issue in HBASE-16616, but we have many {{HFileBlock$PrefetchedHeader}}
not only {{Counter$IndexHolder}} instances.  Here are some OQL counts from Eclipse Memory
Analyzer (MAT).  This shows a number of ThreadLocal instances in the ThreadLocalMap of a single
handler thread.
> {code}
> 			  FROM OBJECTS 0x4ee380430) obj
> WHERE obj.@clazz.@name = "org.apache.hadoop.hbase.io.hfile.HFileBlock$PrefetchedHeader"
> #=> 10980 instances
> {code}
> {code}
> 			  FROM OBJECTS 0x4ee380430) obj
> WHERE obj.@clazz.@name = "org.apache.hadoop.hbase.util.Counter$IndexHolder"
> #=> 2052 instances
> {code}
> Although as described in HBASE-16616 this somewhat seems to be an issue in G1GC side
regarding weakly-reachable objects, we should keep ThreadLocal usage minimal and avoid creating
an indefinite number (in this case, a number of HFiles) of ThreadLocal instances.
> HBASE-16146 removes ThreadLocals from the RPC metrics code.  That may solve the issue
(I just saw the patch, never tested it at all), but the {{HFileBlock$PrefetchedHeader}} are
still there in the ThreadLocalMap, which may cause issues in the future again.
> h5. Our Solution
> We simply removed the whole {{HFileBlock$PrefetchedHeader}} caching and fortunately we
didn't notice any performance degradation for our production workloads.
> Because the PrefetchedHeader caching uses ThreadLocal and because RPCs are handled randomly
in any of the handlers, small Get or small Scan RPCs do not benefit from the caching (See
HBASE-10676 and HBASE-11402 for the details).  Probably, we need to see how well reads are
saved by the caching for large Scan or Get RPCs and especially for compactions if we really
remove the caching. It's probably better if we can remove ThreadLocals without breaking the
current caching behavior.
> FWIW, I'm attaching the patch we applied. It's for CDH 5.4.5.

This message was sent by Atlassian JIRA

View raw message