hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gunnar Tapper <tapper.gun...@gmail.com>
Subject Re: Splitting causes HBase to crash
Date Fri, 13 May 2016 11:39:27 GMT
Hi Ted,

Two of the three logs just logs successful connection attempts. The third
log shows:

2016-05-13 11:37:00,372 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x154a8fa6dbc06eb type:setData
cxid:0x1be8 zxid:0xe0004d100 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-50-109.ec2.internal%2C60020%2C1463123941361-splitting%2Fip-172-31-50-109.ec2.internal%252C60020%252C1463123941361.null0.1463123949342
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-50-109.ec2.internal%2C60020%2C1463123941361-splitting%2Fip-172-31-50-109.ec2.internal%252C60020%252C1463123941361.null0.1463123949342
2016-05-13 11:37:00,771 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x354a8fa667e065e type:setData
cxid:0x14df zxid:0xe0004d101 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
2016-05-13 11:37:01,067 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x254a8fa65550672 type:setData
cxid:0x150e zxid:0xe0004d102 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
2016-05-13 11:37:05,274 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x254a8fa65550672 type:setData
cxid:0x150f zxid:0xe0004d103 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
2016-05-13 11:37:13,550 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x254a8fa65550672 type:setData
cxid:0x1513 zxid:0xe0004d107 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-61-36.ec2.internal%2C60020%2C1463123940830-splitting%2Fip-172-31-61-36.ec2.internal%252C60020%252C1463123940830.null0.1463123949164
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-61-36.ec2.internal%2C60020%2C1463123940830-splitting%2Fip-172-31-61-36.ec2.internal%252C60020%252C1463123940830.null0.1463123949164
2016-05-13 11:37:13,550 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x154a8fa6dbc06e9 type:setData
cxid:0xe09 zxid:0xe0004d109 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-61-36.ec2.internal%2C60020%2C1463123940830-splitting%2Fip-172-31-61-36.ec2.internal%252C60020%252C1463123940830.null0.1463123949164
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-61-36.ec2.internal%2C60020%2C1463123940830-splitting%2Fip-172-31-61-36.ec2.internal%252C60020%252C1463123940830.null0.1463123949164
2016-05-13 11:37:13,552 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x154a8fa6dbc06e9 type:setData
cxid:0xe0c zxid:0xe0004d10b txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-54-241.ec2.internal%2C60020%2C1463123941413-splitting%2Fip-172-31-54-241.ec2.internal%252C60020%252C1463123941413.null0.1463123949331
2016-05-13 11:37:20,707 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x154a8fa6dbc06eb type:setData
cxid:0x22c7 zxid:0xe0004d111 txntype:-1 reqpath:n/a Error
Path:/hbase/splitWAL/WALs%2Fip-172-31-50-109.ec2.internal%2C60020%2C1463123941361-splitting%2Fip-172-31-50-109.ec2.internal%252C60020%252C1463123941361.null0.1463123949342
Error:KeeperErrorCode = BadVersion for
/hbase/splitWAL/WALs%2Fip-172-31-50-109.ec2.internal%2C60020%2C1463123941361-splitting%2Fip-172-31-50-109.ec2.internal%252C60020%252C1463123941361.null0.1463123949342

Thanks,

Gunnar

On Fri, May 13, 2016 at 3:24 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. Unable to list children of znode /hbase/region-in-transition
>
> Looks like there might be some problem with zookeeper quorum.
>
> Can you check zookeeper server logs ?
>
> Cheers
>
> On Fri, May 13, 2016 at 12:17 AM, Gunnar Tapper <tapper.gunnar@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm doing some development testing with Apache Trafodion running
> > HBase Version 1.0.0-cdh5.4.5.
> >
> > All of a sudden, HBase has started to crash. First, it could not be
> > recovered until I changed hbase_master_distributed_log_splitting to
> false.
> > At that point, HBase restarted and sat happily idling for 1 hour. Then, I
> > started Trafodion letting it sit idling for 1 hour.
> >
> > I then started a workload and all RegionServers came crashing down.
> Looking
> > at the log files, I suspected ZooKeeper issues so I restarted ZooKeeper
> and
> > then HBase. Now, the HMaster fails with:
> >
> > 2016-05-13 07:13:52,521 INFO org.apache.hadoop.hbase.master.RegionStates:
> > Transition {a33adb83f77095913adb4701b01c09a0 state=PENDING_OPEN,
> > ts=1463123333157,
> server=ip-172-31-50-109.ec2.internal,60020,1463122925684}
> > to {a33adb83f77095913adb4701b01c09a0 state=OPENING, ts=1463123632517,
> > server=ip-172-31-50-109.ec2.internal,60020,1463122925684}
> > 2016-05-13 07:13:52,527 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
> > master:60000-0x354a8eaea3e007d,
> >
> >
> quorum=ip-172-31-53-252.ec2.internal:2181,ip-172-31-54-241.ec2.internal:2181,ip-172-31-61-36.ec2.internal:2181,
> > baseZNode=/hbase Unable to list children of znode
> > /hbase/region-in-transition
> > java.lang.InterruptedException
> > at java.lang.Object.wait(Native Method)
> > at java.lang.Object.wait(Object.java:503)
> > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
> > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1466)
> > at
> >
> >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:296)
> > at
> >
> >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:518)
> > at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager$5.run(AssignmentManager.java:1420)
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-05-13 07:13:52,527 INFO
> > org.apache.hadoop.hbase.procedure.flush.MasterFlushTableProcedureManager:
> > stop: server shutting down.
> > 2016-05-13 07:13:52,527 INFO org.apache.hadoop.hbase.ipc.RpcServer:
> > Stopping server on 60000
> > 2016-05-13 07:13:52,527 INFO org.apache.hadoop.hbase.ipc.RpcServer:
> > RpcServer.listener,port=60000: stopping
> > 2016-05-13 07:13:52,528 INFO org.apache.hadoop.hbase.ipc.RpcServer:
> > RpcServer.responder: stopped
> > 2016-05-13 07:13:52,528 INFO org.apache.hadoop.hbase.ipc.RpcServer:
> > RpcServer.responder: stopping
> > 2016-05-13 07:13:52,532 ERROR org.apache.zookeeper.ClientCnxn: Error
> while
> > calling watcher
> > java.util.concurrent.RejectedExecutionException: Task
> > java.util.concurrent.FutureTask@33d4a2bd rejected from
> > java.util.concurrent.ThreadPoolExecutor@4d0840e0[Terminated, pool size =
> > 0,
> > active threads = 0, queued tasks = 0, completed tasks = 38681]
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> > at
> >
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> > at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.zkEventWorkersSubmit(AssignmentManager.java:1285)
> > at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.handleAssignmentEvent(AssignmentManager.java:1479)
> > at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.nodeDataChanged(AssignmentManager.java:1244)
> > at
> >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:458)
> > at
> >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > 2016-05-13 07:13:52,533 INFO
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node
> > /hbase/rs/ip-172-31-50-109.ec2.internal,60000,1463122925543 already
> > deleted, retry=false
> > 2016-05-13 07:13:52,534 INFO org.apache.zookeeper.ZooKeeper: Session:
> > 0x354a8eaea3e007d closed
> > 2016-05-13 07:13:52,534 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
> > ip-172-31-50-109.ec2.internal,60000,1463122925543; zookeeper connection
> > closed.
> > 2016-05-13 07:13:52,534 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > master/ip-172-31-50-109.ec2.internal/172.31.50.109:60000 exiting
> > 2016-05-13 07:13:52,534 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> >
> > Suggestions on how to move forward so that I can recover this system?
> >
> > --
> > Thanks,
> >
> > Gunnar
> > *If you think you can you can, if you think you can't you're right.*
> >
>



-- 
Thanks,

Gunnar
*If you think you can you can, if you think you can't you're right.*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message