hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tsuna <tsuna...@gmail.com>
Subject Re: All RegionServers stuck on BadVersion from ZK after cluster restart
Date Wed, 27 Jan 2016 06:26:59 GMT
BTW, I already left the cluster in this state for 24h, so I know it’s stuck
for good.

Since I just restarted everything about half an hour ago, I can tell you
that I now have 31 regions online, and 278 stuck offline, and things aren’t
making progress.  The master is just waiting for log splitting to happen,
and it regularly dies and restarts because it can’t complete initialization.

In case that’s of any help, I’m attaching the last 1000 lines of the master
log, which show that the master is just waiting for stuff to happen and
then committing suicide and restarting and waiting again.

2016-01-27 06:21:30,871 INFO
[r12s3.sjc.aristanetworks.com,9102,1453874320326_splitLogManager__ChoreService_1]
master.SplitLogManager: total tasks = 9 unassigned = 0
tasks={/hbase/splitWAL/WALs%2Fr12s6.sjc.aristanetworks.com
%2C9104%2C1452811286154-splitting%2Fr12s6.sjc.aristanetworks.com%252C9104%252C1452811286154.default.1453728483550=last_update
= 1453875668700 last_version = 30 cur_worker_name =
r12s2.sjc.aristanetworks.com,9104,1453873919713 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0,
/hbase/splitWAL/WALs%2Fr12s8.sjc.aristanetworks.com
%2C9104%2C1452811287196-splitting%2Fr12s8.sjc.aristanetworks.com%252C9104%252C1452811287196.default.1453728125912=last_update
= 1453875650945 last_version = 29 cur_worker_name =
r12s13.sjc.aristanetworks.com,9104,1453873919678 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0,
/hbase/splitWAL/WALs%2Fr12s14.sjc.aristanetworks.com
%2C9104%2C1452811286288-splitting%2Fr12s14.sjc.aristanetworks.com%252C9104%252C1452811286288.default.1453728336644=last_update
= 1453875673936 last_version = 30 cur_worker_name =
r12s11.sjc.aristanetworks.com,9104,1453873919623 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0,
/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
%2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com%252C9104%252C1452811286456.default.1453728374800=last_update
= 1453875666968 last_version = 3593 cur_worker_name =
r12s10.sjc.aristanetworks.com,9104,1453873919762 status = in_progress
incarnation = 2 resubmits = 2 batch = installed = 1 done = 0 error = 0,
/hbase/splitWAL/WALs%2Fr12s13.sjc.aristanetworks.com
%2C9104%2C1452811287287-splitting%2Fr12s13.sjc.aristanetworks.com%252C9104%252C1452811287287.default.1453728621698=last_update
= 1453875666969 last_version = 3650 cur_worker_name =
r12s15.sjc.aristanetworks.com,9104,1453873919845 status = in_progress
incarnation = 2 resubmits = 2 batch = null, /hbase/splitWAL/WALs%
2Fr12s4.sjc.aristanetworks.com%2C9104%2C1452811289547-splitting%
2Fr12s4.sjc.aristanetworks.com%252C9104%252C1452811289547.default.1453728949397=last_update
= 1453875666969 last_version = 3658 cur_worker_name =
r12s9.sjc.aristanetworks.com,9104,1453873919799 status = in_progress
incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0,
/hbase/splitWAL/WALs%2Fr12s5.sjc.aristanetworks.com
%2C9104%2C1452811125084-splitting%2Fr12s5.sjc.aristanetworks.com%252C9104%252C1452811125084.default.1453728624262=last_update
= 1453875666968 last_version = 3437 cur_worker_name =
r12s1.sjc.aristanetworks.com,9104,1453873919645 status = in_progress
incarnation = 2 resubmits = 2 batch = null, /hbase/splitWAL/WALs%
2Fr12s1.sjc.aristanetworks.com%2C9104%2C1452811288618-splitting%
2Fr12s1.sjc.aristanetworks.com%252C9104%252C1452811288618.default.1453728791276=last_update
= 1453875666968 last_version = 2280 cur_worker_name =
r12s14.sjc.aristanetworks.com,9104,1453873919711 status = in_progress
incarnation = 2 resubmits = 1 batch = null, /hbase/splitWAL/WALs%
2Fr12s2.sjc.aristanetworks.com%2C9104%2C1452811286448-splitting%
2Fr12s2.sjc.aristanetworks.com%252C9104%252C1452811286448.default.1453728137282=last_update
= 1453875666968 last_version = 2219 cur_worker_name =
r12s4.sjc.aristanetworks.com,9104,1453873919688 status = in_progress
incarnation = 2 resubmits = 1 batch = null}

On Tue, Jan 26, 2016 at 10:02 PM, tsuna <tsunanet@gmail.com> wrote:

> Hi,
> after a planned power outage one of our HBase clusters isn’t coming back
> up healthy.  The master shows the 16 region servers but zero regions.  All
> the RegionServers are experiencing the same problem, which is that they’re
> getting a BadVersion error from ZooKeeper.  This was with HBase 1.1.2 and
> I just upgraded all the nodes to 1.1.3 to see if this would make a
> difference, but it didn’t.
>
> 2016-01-27 05:54:02,402 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
> coordination.ZkSplitLogWorkerCoordination: BADVERSION failed to assert
> ownership for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
> %252C9104%252C1452811286456.default.1453728374800
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
> = BadVersion for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
> %252C9104%252C1452811286456.default.1453728374800
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:429)
> at
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination.attemptToOwnTask(ZkSplitLogWorkerCoordination.java:370)
> at
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$1.progress(ZkSplitLogWorkerCoordination.java:304)
> at
> org.apache.hadoop.hbase.util.FSHDFSUtils.checkIfCancelled(FSHDFSUtils.java:329)
> at
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:244)
> at
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:162)
> at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:761)
> at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-01-27 05:54:02,404 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
> coordination.ZkSplitLogWorkerCoordination: Failed to heartbeat the
> task/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
> %252C9104%252C1452811286456.default.1453728374800
>
> I’m attaching the full log of the RS this was extracted from, which I just
> restarted on 1.1.3, in case that’s of any help.
>
> I’ve never seen this before and after a bit of digging, I’m not really
> going anywhere.  Any ideas / suggestions?
>

-- 
Benoit "tsuna" Sigoure

Mime
View raw message