hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Region server deadlocks in master master replication
Date Fri, 30 Nov 2012 11:46:50 GMT
Hi,

I have a master master replication setup with hbase 0.94.0 - if only write
to cluster A and replication carries over the data to cluster B. I am
having some really weird issues with cluster B. Basically, all the Priority
RPC handlers are stuck in calls in replicateLogEntries while all the normal
RPC handlers are just waiting on each region server.

>From the logs I could see the following:

1) Region server shutdown
Stopping the region server showed some issues. There were exceptions thrown
while closing down regions - the exceptions were in the localRegionInMeta
calls and also while trying to get the value of /hbase/root-region-server
(I have checked via a manual client, zookeeper is working fine).

2) jstack traces show that there are issues with locating the META and the
ROOT tables

"PRI IPC Server handler 2 on 60020" daemon prio=10 tid=0x00007f4ddcd39000
nid=0x2dbf waiting on condition [0x00007f4dd9edc000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1046)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
at
org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:36)
at
org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:268)
at
org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:198)
at org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:173)
at org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:216)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:171)

"IPC Server handler 3 on 60020" daemon prio=10 tid=0x00007f4ddcb1d800
nid=0x2db6 waiting on condition [0x00007f4dda7e6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000000056aa146e8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348)

3) The region server containing the 'ROOT' also shows the following trace
with jstack

"RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-2" prio=10
tid=0x0000000001f07800 nid=0x575c waiting on condition [0x00007fc3333f2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000056c2ceb70> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)

"RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-1" prio=10
tid=0x0000000002e9f000 nid=0x572d waiting on condition [0x00007fc3337f6000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000056c2ceb70> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)

4) There are some replication related exceptions but not sure if those are
critical.

2012-11-30 00:18:04,575 WARN
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
java.io.EOFException

Also,
012-11-30 00:06:33,830 ERROR
org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to
accept edit because:
java.net.SocketTimeoutException: Call to ip-10-10-54-176.ec2.internal/
10.10.54.176:60020 failed on socket timeout exception:
java.net.SocketTimeoutException: 1500 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected
local=/10.60.53.226:34164remote=ip-10-10-54-176.ec2.internal/
10.10.54.176:60020]
        at
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:949)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:922)
        at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
        at $Proxy12.getClosestRowBefore(Unknown Source)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:965)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1482)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1367)

At this point, when I restart region servers, they end up with 0 regions
and I am not able to bring back the regions they were serving. Any help
would be deeply appreciated.

Thanks
Varun

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message