hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2880) Hung cluster because master is hung because Get inside synchronize on RegionManager never returned
Date Tue, 27 Jul 2010 05:18:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892651#action_12892651
] 

stack commented on HBASE-2880:
------------------------------

So, if the race condition were removed -- i.e. race between basescanner and master assigning
daughters on receipt of the split message -- then this particular lockup would not have happened.

TODO: Review all of the regionserver locks to see if we're doing scans or gets while the lock
is up.

> Hung cluster because master is hung because Get inside synchronize on RegionManager never
returned
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2880
>                 URL: https://issues.apache.org/jira/browse/HBASE-2880
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> I just ran into this testing 0.89 RC candidate.
> So, Master is hung up because all threads are locked out because one thread is stuck
inside a block that is synchronized on RegionManager (0x00007fe1f94777d0 in the below):
> {code}
> 3277 "IPC Server handler 9 on 60000" daemon prio=10 tid=0x00007fe1dc00f000 nid=0x409d
in Object.wait() [0x00007fe1e9200000]
> 3278    java.lang.Thread.State: WAITING (on object monitor)
> 3279         at java.lang.Object.wait(Native Method)
> 3280         at java.lang.Object.wait(Object.java:485)
> 3281         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
> 3282         - locked <0x00007fe1f8672818> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
> 3283         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252)
> 3284         at $Proxy1.get(Unknown Source)
> 3285         at org.apache.hadoop.hbase.master.ServerManager.assignSplitDaughter(ServerManager.java:550)
> 3286         at org.apache.hadoop.hbase.master.ServerManager.processSplitRegion(ServerManager.java:525)
> 3287         - locked <0x00007fe1f94777d0> (a org.apache.hadoop.hbase.master.RegionManager)
> 3288         at org.apache.hadoop.hbase.master.ServerManager.processMsgs(ServerManager.java:476)
> 3289         at org.apache.hadoop.hbase.master.ServerManager.processRegionServerAllsWell(ServerManager.java:425)
> 3290         at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:335)
> 3291         at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:738)
> 3292         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> 3293         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 3294         at java.lang.reflect.Method.invoke(Method.java:597)
> 3295         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
> 3296         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
> {code}
> The above code is not returning because Call#callComplete is never going to be called
on the outstanding Get.  The target RS OOME'd.  Something in the way an OOME is being processed
made it so this connection is not ever going to be cleaned up/notified.
> We're stuck here.
> I'm trying to figure why the clean up is not happening.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message