hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Region server deadlocks in master master replication
Date Fri, 30 Nov 2012 18:08:08 GMT
Hi Jean,

I looked at the release notes for 0.94.1 and 0.94.2 and it looks like all
the fixes there have to do with splitting of regions (I maybe wrong). For
my cluster(s), splits are off.

Varun

On Fri, Nov 30, 2012 at 10:03 AM, Varun Sharma <varun@pinterest.com> wrote:

> Hi Jean,
>
> Thanks ! Could you point me to some of the fixes ? We currently use
> hbase-0.94.0 with some other patches.
>
> On Fri, Nov 30, 2012 at 8:53 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> Use 0.94.2, it has all the fixes you need.
>>
>> J-D
>>
>> On Fri, Nov 30, 2012 at 4:56 AM, Varun Sharma <varun@pinterest.com>
>> wrote:
>>
>> > After clearing out some files in /.logs which had size 0 and restarting
>> the
>> > cluster - all regions came online and starting serving. But now I am
>> again
>> > stuck. The master moved some regions to rebalance after the restart and
>> > some of them are PENDING_CLOSE while 2 regions are offline. Again all
>> PRI
>> > handlers are stuck in replicateLogEntries() - looking at the region
>> server
>> > status page. Moreover jstack shows that these are stuck on
>> > locateRegionInMeta. The other handlers are waiting as normal. Also there
>> > are 0 byte files now under ./logs  -not sure if these are causing the
>> > issues...
>> >
>> > Thanks !
>> >
>> > On Fri, Nov 30, 2012 at 3:46 AM, Varun Sharma <varun@pinterest.com>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > I have a master master replication setup with hbase 0.94.0 - if only
>> > write
>> > > to cluster A and replication carries over the data to cluster B. I am
>> > > having some really weird issues with cluster B. Basically, all the
>> > Priority
>> > > RPC handlers are stuck in calls in replicateLogEntries while all the
>> > normal
>> > > RPC handlers are just waiting on each region server.
>> > >
>> > > From the logs I could see the following:
>> > >
>> > > 1) Region server shutdown
>> > > Stopping the region server showed some issues. There were exceptions
>> > > thrown while closing down regions - the exceptions were in the
>> > > localRegionInMeta calls and also while trying to get the value of
>> > > /hbase/root-region-server (I have checked via a manual client,
>> zookeeper
>> > is
>> > > working fine).
>> > >
>> > > 2) jstack traces show that there are issues with locating the META and
>> > the
>> > > ROOT tables
>> > >
>> > > "PRI IPC Server handler 2 on 60020" daemon prio=10
>> tid=0x00007f4ddcd39000
>> > > nid=0x2dbf waiting on condition [0x00007f4dd9edc000]
>> > >    java.lang.Thread.State: TIMED_WAITING (sleeping)
>> > > at java.lang.Thread.sleep(Native Method)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1046)
>> > >  at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
>> > >  at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
>> > > at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
>> > >  at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:36)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:268)
>> > >  at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:198)
>> > > at
>> > org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:173)
>> > >  at
>> > >
>> org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:216)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:171)
>> > >
>> > > "IPC Server handler 3 on 60020" daemon prio=10 tid=0x00007f4ddcb1d800
>> > > nid=0x2db6 waiting on condition [0x00007f4dda7e6000]
>> > >    java.lang.Thread.State: WAITING (parking)
>> > >  at sun.misc.Unsafe.park(Native Method)
>> > > - parking to wait for  <0x000000056aa146e8> (a
>> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> > >  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> > > at
>> > >
>> >
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> > >  at
>> > >
>> >
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348)
>> > >
>> > > 3) The region server containing the 'ROOT' also shows the following
>> trace
>> > > with jstack
>> > >
>> > > "RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-2"
>> > > prio=10 tid=0x0000000001f07800 nid=0x575c waiting on condition
>> > > [0x00007fc3333f2000]
>> > >    java.lang.Thread.State: WAITING (parking)
>> > >         at sun.misc.Unsafe.park(Native Method)
>> > >         - parking to wait for  <0x000000056c2ceb70> (a
>> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> > >         at
>> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> > >         at
>> > >
>> >
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> > >         at
>> > >
>> >
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
>> > >         at
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
>> > >         at
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
>> > >         at
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> > >         at java.lang.Thread.run(Thread.java:679)
>> > >
>> > > "RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-1"
>> > > prio=10 tid=0x0000000002e9f000 nid=0x572d waiting on condition
>> > > [0x00007fc3337f6000]
>> > >    java.lang.Thread.State: WAITING (parking)
>> > >         at sun.misc.Unsafe.park(Native Method)
>> > >         - parking to wait for  <0x000000056c2ceb70> (a
>> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> > >         at
>> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> > >         at
>> > >
>> >
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> > >         at
>> > >
>> >
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
>> > >         at
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
>> > >         at
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
>> > >         at
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> > >         at java.lang.Thread.run(Thread.java:679)
>> > >
>> > > 4) There are some replication related exceptions but not sure if those
>> > are
>> > > critical.
>> > >
>> > > 2012-11-30 00:18:04,575 WARN
>> > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
>> > Got:
>> > > java.io.EOFException
>> > >
>> > > Also,
>> > > 012-11-30 00:06:33,830 ERROR
>> > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink:
>> Unable
>> > to
>> > > accept edit because:
>> > > java.net.SocketTimeoutException: Call to ip-10-10-54-176.ec2.internal/
>> > > 10.10.54.176:60020 failed on socket timeout exception:
>> > > java.net.SocketTimeoutException: 1500 millis timeout while waiting for
>> > > channel to be ready for read. ch :
>> > > java.nio.channels.SocketChannel[connected local=/10.60.53.226:34164
>> > remote=ip-10-10-54-176.ec2.internal/
>> > > 10.10.54.176:60020]
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:949)
>> > >         at
>> > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:922)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
>> > >         at $Proxy12.getClosestRowBefore(Unknown Source)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:965)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1482)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1367)
>> > >
>> > > At this point, when I restart region servers, they end up with 0
>> regions
>> > > and I am not able to bring back the regions they were serving. Any
>> help
>> > > would be deeply appreciated.
>> > >
>> > > Thanks
>> > > Varun
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message