hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samir Ahmic <ahmic.sa...@gmail.com>
Subject Re: All RegionServers stuck on BadVersion from ZK after cluster restart
Date Wed, 27 Jan 2016 08:59:58 GMT
Hi,

This looks like some WALs splitting operations are not able to finish
causing master to abort. First thing i would try is to stop cluster, clear
zk data and then try to start it again. If that does not solve issue you
probably have case of corrupted WALs files. In that case you will have to
locate corrupted WALs files on hdfs and remove them in order to start
master.

Cheers
Samir

On Wed, Jan 27, 2016 at 7:26 AM, tsuna <tsunanet@gmail.com> wrote:

> BTW, I already left the cluster in this state for 24h, so I know it’s
> stuck for good.
>
> Since I just restarted everything about half an hour ago, I can tell you
> that I now have 31 regions online, and 278 stuck offline, and things aren’t
> making progress.  The master is just waiting for log splitting to happen,
> and it regularly dies and restarts because it can’t complete initialization.
>
> In case that’s of any help, I’m attaching the last 1000 lines of the
> master log, which show that the master is just waiting for stuff to happen
> and then committing suicide and restarting and waiting again.
>
> 2016-01-27 06:21:30,871 INFO  [r12s3.sjc.aristanetworks.com,9102,1453874320326_splitLogManager__ChoreService_1]
> master.SplitLogManager: total tasks = 9 unassigned = 0
> tasks={/hbase/splitWAL/WALs%2Fr12s6.sjc.aristanetworks.com
> %2C9104%2C1452811286154-splitting%2Fr12s6.sjc.aristanetworks.com%252C9104%252C1452811286154.default.1453728483550=last_update
> = 1453875668700 last_version = 30 cur_worker_name =
> r12s2.sjc.aristanetworks.com,9104,1453873919713 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0,
> /hbase/splitWAL/WALs%2Fr12s8.sjc.aristanetworks.com
> %2C9104%2C1452811287196-splitting%2Fr12s8.sjc.aristanetworks.com%252C9104%252C1452811287196.default.1453728125912=last_update
> = 1453875650945 last_version = 29 cur_worker_name =
> r12s13.sjc.aristanetworks.com,9104,1453873919678 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0,
> /hbase/splitWAL/WALs%2Fr12s14.sjc.aristanetworks.com
> %2C9104%2C1452811286288-splitting%2Fr12s14.sjc.aristanetworks.com%252C9104%252C1452811286288.default.1453728336644=last_update
> = 1453875673936 last_version = 30 cur_worker_name =
> r12s11.sjc.aristanetworks.com,9104,1453873919623 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0,
> /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com%252C9104%252C1452811286456.default.1453728374800=last_update
> = 1453875666968 last_version = 3593 cur_worker_name =
> r12s10.sjc.aristanetworks.com,9104,1453873919762 status = in_progress
> incarnation = 2 resubmits = 2 batch = installed = 1 done = 0 error = 0,
> /hbase/splitWAL/WALs%2Fr12s13.sjc.aristanetworks.com
> %2C9104%2C1452811287287-splitting%2Fr12s13.sjc.aristanetworks.com%252C9104%252C1452811287287.default.1453728621698=last_update
> = 1453875666969 last_version = 3650 cur_worker_name =
> r12s15.sjc.aristanetworks.com,9104,1453873919845 status = in_progress
> incarnation = 2 resubmits = 2 batch = null, /hbase/splitWAL/WALs%
> 2Fr12s4.sjc.aristanetworks.com%2C9104%2C1452811289547-splitting%
> 2Fr12s4.sjc.aristanetworks.com%252C9104%252C1452811289547.default.1453728949397=last_update
> = 1453875666969 last_version = 3658 cur_worker_name =
> r12s9.sjc.aristanetworks.com,9104,1453873919799 status = in_progress
> incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0,
> /hbase/splitWAL/WALs%2Fr12s5.sjc.aristanetworks.com
> %2C9104%2C1452811125084-splitting%2Fr12s5.sjc.aristanetworks.com%252C9104%252C1452811125084.default.1453728624262=last_update
> = 1453875666968 last_version = 3437 cur_worker_name =
> r12s1.sjc.aristanetworks.com,9104,1453873919645 status = in_progress
> incarnation = 2 resubmits = 2 batch = null, /hbase/splitWAL/WALs%
> 2Fr12s1.sjc.aristanetworks.com%2C9104%2C1452811288618-splitting%
> 2Fr12s1.sjc.aristanetworks.com%252C9104%252C1452811288618.default.1453728791276=last_update
> = 1453875666968 last_version = 2280 cur_worker_name =
> r12s14.sjc.aristanetworks.com,9104,1453873919711 status = in_progress
> incarnation = 2 resubmits = 1 batch = null, /hbase/splitWAL/WALs%
> 2Fr12s2.sjc.aristanetworks.com%2C9104%2C1452811286448-splitting%
> 2Fr12s2.sjc.aristanetworks.com%252C9104%252C1452811286448.default.1453728137282=last_update
> = 1453875666968 last_version = 2219 cur_worker_name =
> r12s4.sjc.aristanetworks.com,9104,1453873919688 status = in_progress
> incarnation = 2 resubmits = 1 batch = null}
>
> On Tue, Jan 26, 2016 at 10:02 PM, tsuna <tsunanet@gmail.com> wrote:
>
>> Hi,
>> after a planned power outage one of our HBase clusters isn’t coming back
>> up healthy.  The master shows the 16 region servers but zero regions.  All
>> the RegionServers are experiencing the same problem, which is that they’re
>> getting a BadVersion error from ZooKeeper.  This was with HBase 1.1.2
>> and I just upgraded all the nodes to 1.1.3 to see if this would make a
>> difference, but it didn’t.
>>
>> 2016-01-27 05:54:02,402 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
>> coordination.ZkSplitLogWorkerCoordination: BADVERSION failed to assert
>> ownership for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
>> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
>> %252C9104%252C1452811286456.default.1453728374800
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
>> = BadVersion for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
>> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
>> %252C9104%252C1452811286456.default.1453728374800
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>> at
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:429)
>> at
>> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination.attemptToOwnTask(ZkSplitLogWorkerCoordination.java:370)
>> at
>> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$1.progress(ZkSplitLogWorkerCoordination.java:304)
>> at
>> org.apache.hadoop.hbase.util.FSHDFSUtils.checkIfCancelled(FSHDFSUtils.java:329)
>> at
>> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:244)
>> at
>> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:162)
>> at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:761)
>> at
>> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
>> at
>> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
>> at
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
>> at
>> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
>> at
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 2016-01-27 05:54:02,404 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
>> coordination.ZkSplitLogWorkerCoordination: Failed to heartbeat the
>> task/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
>> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
>> %252C9104%252C1452811286456.default.1453728374800
>>
>> I’m attaching the full log of the RS this was extracted from, which I
>> just restarted on 1.1.3, in case that’s of any help.
>>
>> I’ve never seen this before and after a bit of digging, I’m not really
>> going anywhere.  Any ideas / suggestions?
>>
>
> --
> Benoit "tsuna" Sigoure
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message