hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Stepachev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11460) Deadlock in HMaster on masterAndZKLock in HConnectionManager
Date Fri, 04 Jul 2014 12:40:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052410#comment-14052410
] 

Andrey Stepachev commented on HBASE-11460:
------------------------------------------

Thank you Ted, looks like it fixes issue. Great work.

> Deadlock in HMaster on masterAndZKLock in HConnectionManager
> ------------------------------------------------------------
>
>                 Key: HBASE-11460
>                 URL: https://issues.apache.org/jira/browse/HBASE-11460
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: Andrey Stepachev
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 0.99.0
>
>         Attachments: 11460-v1.txt, threads.tdump
>
>
> On one of our clusters we got a deadlock in HMaster.
> In a nutshell deadlock caused by using one HConnectionManager for serving client-like
calls and calls from HMaster RPC handlers.
> HBaseAdmin uses HConnectionManager which takes a lock masterAndZKLock.
> On the other side of this game sits TablesNamespaceManager (TNM). This class uses HConnectionManager
too (in my case for getting list of available namespaces). 
> Problem is that HMaster class uses TNM  for serving RPC requests.
> If we look at TNM more closely, we can see, that this class is totally synchronised.
> Thats gives us a problem.
> WebInterface calls request via HConnectionManager and locks masterAndZKLock.
> Connection is blocking, so RpcClient will spin, awaiting for reply (while holding lock).
> That how it looks like in thread dump:
> {code}
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
> 	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1435)
> 	- locked <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
> 	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
> 	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
> 	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1467)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2093)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1819)
> 	- locked <0x00000000d15dc668> (a java.lang.Object)
> 	at org.apache.hadoop.hbase.client.HBaseAdmin$MasterCallable.prepare(HBaseAdmin.java:3187)
> 	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
> 	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
> 	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:96)
> 	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
> 	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3214)
> 	at org.apache.hadoop.hbase.client.HBaseAdmin.listTableDescriptorsByNamespace(HBaseAdmin.java:2265)
> {code}
> Some other client call any HMaster RPC, and it calls TablesNamespaceManager methods,
which in turn will block on HConnectionManager global lock masterAndZKLock.
> That how it looks like:
> {code}
>   java.lang.Thread.State: BLOCKED (on object monitor)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1699)
> 	- waiting to lock <0x00000000d15dc668> (a java.lang.Object)
> 	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:100)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isTableDisabled(HConnectionManager.java:874)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1027)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
> 	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
> 	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
> 	- locked <0x00000000cd0ef108> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
> 	at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
> 	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
> 	- locked <0x00000000d1b49fd8> (a java.lang.Object)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
> 	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
> 	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
> 	- locked <0x00000000cd0ef248> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
> 	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:756)
> 	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:134)
> 	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:118)
> 	- locked <0x00000000d189da20> (a org.apache.hadoop.hbase.master.TableNamespaceManager)
> 	at org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3113)
> 	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3133)
> 	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3034)
> 	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38261)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> 	at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
> {code}
> And finally original handler, which should serve request from WebGUI can be blocked on
TNM methods effectively forming dead lock.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message