hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2880) Hung cluster because master is hung because Get inside synchronize on RegionManager never returned
Date Tue, 27 Jul 2010 04:58:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892647#action_12892647
] 

stack commented on HBASE-2880:
------------------------------

I think i know what happened.

1. Master receives a split message from a random RS out on cluster:

2010-07-23 23:25:49,144 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
TestTable,0085903910,1279952620402.a7e855e8d2c40ad6ac3d2e22b386681c.: Daughters; TestTable,0085903910,1279952729578.13a32aa514a77cbbbe3af50afc4595a1.,
TestTable,0089244873,1279952729578.a09a030cf17f38c88a8294e05e98c2ae. from sv2borg184,60020,1279951359490;
1 of 1

We process the above message.  Processing a region split, we synchronize on regionmanager.
 Under the regionmanager lock, we go to assign the split daughters.  To be sure, for sure,
before putting daughter on to assign queue, we check the daughter hasn't already been assigned
by base scanner.  See HBASE-1784.  Checking if region hasn't already been assigned, we do
a Get on .META.

2. At the time of the master Get to .META., the RS hosting .META. OOMEs.

{code}
2010-07-23 23:25:43,286 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler
9 on 60020 took 3835ms appending an edit to hlog; editcount=3
2010-07-23 23:25:45,541 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler
18 on 60020 took 2255ms appending an edit to hlog; editcount=4
2010-07-23 23:25:59,431 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting
region server serverName=sv2borg187,60020,1279951359807, load=(requests=32, regions=29, usedHeap=959,
maxHeap=987): OutOfMemoryError, aborting
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:829)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:423)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:323)
2010-07-23 23:25:59,432 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=29, stores=30, storefiles=52, storefileIndexSize=8, memstoreSize=505,
compactionQueueSize=6, usedHeap=976, maxHeap=987, blockCacheSize=160252032, blockCacheFree=46841728,
blockCacheCount=2396, blockCacheHitRatio=0
2010-07-23 23:25:59,432 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020:
exiting on OOME
{code}

This happens ~9 seconds after the master receives the split message.  I checked clocks on
the machines and they are aligned.

It looks like processing the daughter takes a second or so doing zk stuff.  Leading up to
an OOME, we're going to be doing a bunch of GCs so RS is not going to be making progress on
any work coming in. Looking in gc logs, indeed I see Ful GC after Full GC w/ some taking multiple
seconds and then there is the heap dump in the middle of it all, all of which takes time:


{code}
2010-07-23T23:25:48.068-0700: 1388.920: [Full GC 1388.920: [CMS2010-07-23T23:25:48.157-0700:
1389.010: [CMS-concurrent-mark: 0.149/0.151 secs] [Times: user=0.76 sys=0.03, real=0.15 secs]
 (concurrent mode failure): 895999K->895999K(896000K), 1.0361460 secs] 1011188K->980450K(1011200K),
[CMS Perm : 17666K->17666K(29588K)] icms_dc=100 , 1.0362380 secs] [Times: user=1.28 sys=0.01,
real=1.04 secs]
2010-07-23T23:25:49.105-0700: 1389.957: [GC [1 CMS-initial-mark: 895999K(896000K)] 980528K(1011200K),
0.0355820 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]
2010-07-23T23:25:49.141-0700: 1389.993: [CMS-concurrent-mark-start]
2010-07-23T23:25:49.160-0700: 1390.012: [Full GC 1390.012: [CMS2010-07-23T23:25:49.281-0700:
1390.134: [CMS-concurrent-mark: 0.139/0.141 secs] [Times: user=0.62 sys=0.01, real=0.14 secs]
 (concurrent mode failure): 895999K->895999K(896000K), 1.1084360 secs] 1007166K->982021K(1011200K),
[CMS Perm : 17666K->17666K(29588K)] icms_dc=100 , 1.1085270 secs] [Times: user=1.46 sys=0.00,
real=1.11 secs]
2010-07-23T23:25:50.271-0700: 1391.123: [GC [1 CMS-initial-mark: 895999K(896000K)] 991433K(1011200K),
0.0327910 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]
2010-07-23T23:25:50.304-0700: 1391.156: [CMS-concurrent-mark-start]
2010-07-23T23:25:50.309-0700: 1391.161: [Full GC 1391.161: [CMS2010-07-23T23:25:50.430-0700:
1391.282: [CMS-concurrent-mark: 0.125/0.126 secs] [Times: user=0.48 sys=0.02, real=0.13 secs]
 (concurrent mode failure): 895999K->895999K(896000K), 1.0769650 secs] 1001931K->998796K(1011200K),
[CMS Perm : 17666K->17666K(29588K)] icms_dc=100 , 1.0770800 secs] [Times: user=1.39 sys=0.03,
real=1.08 secs]
2010-07-23T23:25:51.386-0700: 1392.238: [Full GC 1392.238: [CMS[Unloading class sun.reflect.GeneratedMethodAccessor17]
[Unloading class sun.reflect.GeneratedMethodAccessor4]
[Unloading class sun.reflect.GeneratedMethodAccessor9]
[Unloading class sun.reflect.GeneratedMethodAccessor6]
[Unloading class sun.reflect.GeneratedMethodAccessor1]
[Unloading class sun.reflect.GeneratedMethodAccessor10]
[Unloading class sun.reflect.GeneratedMethodAccessor18]
[Unloading class sun.reflect.GeneratedConstructorAccessor22]
: 895999K->895999K(896000K), 1.0589560 secs] 998796K->998729K(1011200K), [CMS Perm :
17666K->17636K(29588K)] icms_dc=100 , 1.0590530 secs] [Times: user=1.06 sys=0.00, real=1.06
secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid1432.hprof ...
2010-07-23T23:25:52.457-0700: 1393.309: [GC [1 CMS-initial-mark: 895999K(896000K)] 998729K(1011200K),
0.0322630 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]
2010-07-23T23:25:52.489-0700: 1393.342: [CMS-concurrent-mark-start]
Heap dump file created [1031291570 bytes in 6.973 secs]
2010-07-23T23:25:59.537-0700: 1400.389: [Full GC 1400.389: [CMS2010-07-23T23:25:59.556-0700:
1400.408: [CMS-concurrent-mark: 0.125/7.067 secs] [Times: user=3.85 sys=1.96, real=7.06 secs]
 (concurrent mode failure): 895999K->895999K(896000K), 0.9934440 secs] 1011146K->975171K(1011200K),
[CMS Perm : 17638K->17637K(29588K)] icms_dc=100 , 0.9935330 secs] [Times: user=1.04 sys=0.00,
real=0.99 secs]
{code}

So, the OOME sets the abort and stop flags and we some threads exiting but the weird thing
is that the HRS is still up, now, days after the OOME.  The HRS did not go down.

3. Looking at HRS thread dump, its stuck trying to regionServerReport the master.  It must
have gone into reporting master at about the time of the OOME.  We're stuck here so have not
noticed the shutdown flag so, have not shutdown the HRS.

Here is from master thread dump:

{code}
"IPC Server handler 9 on 60000" daemon prio=10 tid=0x00007fe1dc00f000 nid=0x409d in Object.wait()
[0x00007fe1e9200000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
        - locked <0x00007fe1f8672818> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252)
        at $Proxy1.get(Unknown Source)
        at org.apache.hadoop.hbase.master.ServerManager.assignSplitDaughter(ServerManager.java:550)
        at org.apache.hadoop.hbase.master.ServerManager.processSplitRegion(ServerManager.java:525)
        - locked <0x00007fe1f94777d0> (a org.apache.hadoop.hbase.master.RegionManager)
        at org.apache.hadoop.hbase.master.ServerManager.processMsgs(ServerManager.java:476)
        at org.apache.hadoop.hbase.master.ServerManager.processRegionServerAllsWell(ServerManager.java:425)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:335)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:738)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
        
"IPC Server handler 8 on 60000" daemon prio=10 tid=0x00007fe1dc00d000 nid=0x409c waiting for
monitor entry [0x00007fe1e9301000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.hbase.master.ServerManager.processMsgs(ServerManager.java:489)
        - waiting to lock <0x00007fe1f94777d0> (a org.apache.hadoop.hbase.master.RegionManager)
        at org.apache.hadoop.hbase.master.ServerManager.processRegionServerAllsWell(ServerManager.java:425)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:335)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:738)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) 
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
{code}

Two threads are in processRegionServerAllsWell.  One is processing daughter split.  Other
is waiting to lock RegionManager so can complete regionServerReport.

I'm not clear yet on why the HBaseServer is still up though it OOME'd.  I can see Handler's
all up and running still in the bad HRS.

> Hung cluster because master is hung because Get inside synchronize on RegionManager never
returned
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2880
>                 URL: https://issues.apache.org/jira/browse/HBASE-2880
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> I just ran into this testing 0.89 RC candidate.
> So, Master is hung up because all threads are locked out because one thread is stuck
inside a block that is synchronized on RegionManager (0x00007fe1f94777d0 in the below):
> {code}
> 3277 "IPC Server handler 9 on 60000" daemon prio=10 tid=0x00007fe1dc00f000 nid=0x409d
in Object.wait() [0x00007fe1e9200000]
> 3278    java.lang.Thread.State: WAITING (on object monitor)
> 3279         at java.lang.Object.wait(Native Method)
> 3280         at java.lang.Object.wait(Object.java:485)
> 3281         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
> 3282         - locked <0x00007fe1f8672818> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
> 3283         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252)
> 3284         at $Proxy1.get(Unknown Source)
> 3285         at org.apache.hadoop.hbase.master.ServerManager.assignSplitDaughter(ServerManager.java:550)
> 3286         at org.apache.hadoop.hbase.master.ServerManager.processSplitRegion(ServerManager.java:525)
> 3287         - locked <0x00007fe1f94777d0> (a org.apache.hadoop.hbase.master.RegionManager)
> 3288         at org.apache.hadoop.hbase.master.ServerManager.processMsgs(ServerManager.java:476)
> 3289         at org.apache.hadoop.hbase.master.ServerManager.processRegionServerAllsWell(ServerManager.java:425)
> 3290         at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:335)
> 3291         at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:738)
> 3292         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> 3293         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 3294         at java.lang.reflect.Method.invoke(Method.java:597)
> 3295         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
> 3296         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
> {code}
> The above code is not returning because Call#callComplete is never going to be called
on the outstanding Get.  The target RS OOME'd.  Something in the way an OOME is being processed
made it so this connection is not ever going to be cleaned up/notified.
> We're stuck here.
> I'm trying to figure why the clean up is not happening.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message