hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tsuna <tsuna...@gmail.com>
Subject All RegionServers stuck on BadVersion from ZK after cluster restart
Date Wed, 27 Jan 2016 06:02:35 GMT
Hi,
after a planned power outage one of our HBase clusters isn’t coming back up
healthy.  The master shows the 16 region servers but zero regions.  All the
RegionServers are experiencing the same problem, which is that they’re
getting a BadVersion error from ZooKeeper.  This was with HBase 1.1.2 and I
just upgraded all the nodes to 1.1.3 to see if this would make a
difference, but it didn’t.

2016-01-27 05:54:02,402 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
coordination.ZkSplitLogWorkerCoordination: BADVERSION failed to assert
ownership for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
%2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
%252C9104%252C1452811286456.default.1453728374800
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
%2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
%252C9104%252C1452811286456.default.1453728374800
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:429)
at
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination.attemptToOwnTask(ZkSplitLogWorkerCoordination.java:370)
at
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$1.progress(ZkSplitLogWorkerCoordination.java:304)
at
org.apache.hadoop.hbase.util.FSHDFSUtils.checkIfCancelled(FSHDFSUtils.java:329)
at
org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:244)
at
org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:162)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:761)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-01-27 05:54:02,404 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
coordination.ZkSplitLogWorkerCoordination: Failed to heartbeat the
task/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
%2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
%252C9104%252C1452811286456.default.1453728374800

I’m attaching the full log of the RS this was extracted from, which I just
restarted on 1.1.3, in case that’s of any help.

I’ve never seen this before and after a bit of digging, I’m not really
going anywhere.  Any ideas / suggestions?

-- 
Benoit "tsuna" Sigoure

Mime
View raw message