Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAFKYj4dxWxf82SKMtwS=qz4_T1jEcn4AwLtY2fgy5aoLO=XA5w@mail.gmail.com>
References: 
 <CAFKYj4d5hKxL_6-pRyDdRr9=jjLby-3O0xRhnxg681NX8wuC6g@mail.gmail.com>
	<CAFKYj4dxWxf82SKMtwS=qz4_T1jEcn4AwLtY2fgy5aoLO=XA5w@mail.gmail.com>
Date: Wed, 27 Jan 2016 09:59:58 +0100
Message-ID: 
 <CADKW6+Bw=zbukkqMCbf60z1QWLZgoKnwJXdN1i222Uf7Lqw7mw@mail.gmail.com>
Subject: Re: All RegionServers stuck on BadVersion from ZK after cluster
 restart
From: Samir Ahmic <ahmic.samir@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=001a11c2f2daceacd9052a4d06de

--001a11c2f2daceacd9052a4d06de
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,

This looks like some WALs splitting operations are not able to finish
causing master to abort. First thing i would try is to stop cluster, clear
zk data and then try to start it again. If that does not solve issue you
probably have case of corrupted WALs files. In that case you will have to
locate corrupted WALs files on hdfs and remove them in order to start
master.

Cheers
Samir

On Wed, Jan 27, 2016 at 7:26 AM, tsuna <tsunanet@gmail.com> wrote:

> BTW, I already left the cluster in this state for 24h, so I know it=E2=80=
=99s
> stuck for good.
>
> Since I just restarted everything about half an hour ago, I can tell you
> that I now have 31 regions online, and 278 stuck offline, and things aren=
=E2=80=99t
> making progress.  The master is just waiting for log splitting to happen,
> and it regularly dies and restarts because it can=E2=80=99t complete init=
ialization.
>
> In case that=E2=80=99s of any help, I=E2=80=99m attaching the last 1000 l=
ines of the
> master log, which show that the master is just waiting for stuff to happe=
n
> and then committing suicide and restarting and waiting again.
>
> 2016-01-27 06:21:30,871 INFO  [r12s3.sjc.aristanetworks.com,9102,14538743=
20326_splitLogManager__ChoreService_1]
> master.SplitLogManager: total tasks =3D 9 unassigned =3D 0
> tasks=3D{/hbase/splitWAL/WALs%2Fr12s6.sjc.aristanetworks.com
> %2C9104%2C1452811286154-splitting%2Fr12s6.sjc.aristanetworks.com%252C9104=
%252C1452811286154.default.1453728483550=3Dlast_update
> =3D 1453875668700 last_version =3D 30 cur_worker_name =3D
> r12s2.sjc.aristanetworks.com,9104,1453873919713 status =3D in_progress
> incarnation =3D 0 resubmits =3D 0 batch =3D installed =3D 1 done =3D 0 er=
ror =3D 0,
> /hbase/splitWAL/WALs%2Fr12s8.sjc.aristanetworks.com
> %2C9104%2C1452811287196-splitting%2Fr12s8.sjc.aristanetworks.com%252C9104=
%252C1452811287196.default.1453728125912=3Dlast_update
> =3D 1453875650945 last_version =3D 29 cur_worker_name =3D
> r12s13.sjc.aristanetworks.com,9104,1453873919678 status =3D in_progress
> incarnation =3D 0 resubmits =3D 0 batch =3D installed =3D 1 done =3D 0 er=
ror =3D 0,
> /hbase/splitWAL/WALs%2Fr12s14.sjc.aristanetworks.com
> %2C9104%2C1452811286288-splitting%2Fr12s14.sjc.aristanetworks.com%252C910=
4%252C1452811286288.default.1453728336644=3Dlast_update
> =3D 1453875673936 last_version =3D 30 cur_worker_name =3D
> r12s11.sjc.aristanetworks.com,9104,1453873919623 status =3D in_progress
> incarnation =3D 0 resubmits =3D 0 batch =3D installed =3D 1 done =3D 0 er=
ror =3D 0,
> /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com%252C910=
4%252C1452811286456.default.1453728374800=3Dlast_update
> =3D 1453875666968 last_version =3D 3593 cur_worker_name =3D
> r12s10.sjc.aristanetworks.com,9104,1453873919762 status =3D in_progress
> incarnation =3D 2 resubmits =3D 2 batch =3D installed =3D 1 done =3D 0 er=
ror =3D 0,
> /hbase/splitWAL/WALs%2Fr12s13.sjc.aristanetworks.com
> %2C9104%2C1452811287287-splitting%2Fr12s13.sjc.aristanetworks.com%252C910=
4%252C1452811287287.default.1453728621698=3Dlast_update
> =3D 1453875666969 last_version =3D 3650 cur_worker_name =3D
> r12s15.sjc.aristanetworks.com,9104,1453873919845 status =3D in_progress
> incarnation =3D 2 resubmits =3D 2 batch =3D null, /hbase/splitWAL/WALs%
> 2Fr12s4.sjc.aristanetworks.com%2C9104%2C1452811289547-splitting%
> 2Fr12s4.sjc.aristanetworks.com%252C9104%252C1452811289547.default.1453728=
949397=3Dlast_update
> =3D 1453875666969 last_version =3D 3658 cur_worker_name =3D
> r12s9.sjc.aristanetworks.com,9104,1453873919799 status =3D in_progress
> incarnation =3D 1 resubmits =3D 1 batch =3D installed =3D 1 done =3D 0 er=
ror =3D 0,
> /hbase/splitWAL/WALs%2Fr12s5.sjc.aristanetworks.com
> %2C9104%2C1452811125084-splitting%2Fr12s5.sjc.aristanetworks.com%252C9104=
%252C1452811125084.default.1453728624262=3Dlast_update
> =3D 1453875666968 last_version =3D 3437 cur_worker_name =3D
> r12s1.sjc.aristanetworks.com,9104,1453873919645 status =3D in_progress
> incarnation =3D 2 resubmits =3D 2 batch =3D null, /hbase/splitWAL/WALs%
> 2Fr12s1.sjc.aristanetworks.com%2C9104%2C1452811288618-splitting%
> 2Fr12s1.sjc.aristanetworks.com%252C9104%252C1452811288618.default.1453728=
791276=3Dlast_update
> =3D 1453875666968 last_version =3D 2280 cur_worker_name =3D
> r12s14.sjc.aristanetworks.com,9104,1453873919711 status =3D in_progress
> incarnation =3D 2 resubmits =3D 1 batch =3D null, /hbase/splitWAL/WALs%
> 2Fr12s2.sjc.aristanetworks.com%2C9104%2C1452811286448-splitting%
> 2Fr12s2.sjc.aristanetworks.com%252C9104%252C1452811286448.default.1453728=
137282=3Dlast_update
> =3D 1453875666968 last_version =3D 2219 cur_worker_name =3D
> r12s4.sjc.aristanetworks.com,9104,1453873919688 status =3D in_progress
> incarnation =3D 2 resubmits =3D 1 batch =3D null}
>
> On Tue, Jan 26, 2016 at 10:02 PM, tsuna <tsunanet@gmail.com> wrote:
>
>> Hi,
>> after a planned power outage one of our HBase clusters isn=E2=80=99t com=
ing back
>> up healthy.  The master shows the 16 region servers but zero regions.  A=
ll
>> the RegionServers are experiencing the same problem, which is that they=
=E2=80=99re
>> getting a BadVersion error from ZooKeeper.  This was with HBase 1.1.2
>> and I just upgraded all the nodes to 1.1.3 to see if this would make a
>> difference, but it didn=E2=80=99t.
>>
>> 2016-01-27 05:54:02,402 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
>> coordination.ZkSplitLogWorkerCoordination: BADVERSION failed to assert
>> ownership for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
>> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
>> %252C9104%252C1452811286456.default.1453728374800
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCod=
e
>> =3D BadVersion for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
>> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
>> %252C9104%252C1452811286456.default.1453728374800
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>> at
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(Recoverab=
leZooKeeper.java:429)
>> at
>> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination.attemp=
tToOwnTask(ZkSplitLogWorkerCoordination.java:370)
>> at
>> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$1.prog=
ress(ZkSplitLogWorkerCoordination.java:304)
>> at
>> org.apache.hadoop.hbase.util.FSHDFSUtils.checkIfCancelled(FSHDFSUtils.ja=
va:329)
>> at
>> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils=
.java:244)
>> at
>> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.ja=
va:162)
>> at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:76=
1)
>> at
>> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:29=
7)
>> at
>> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:23=
5)
>> at
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorke=
r.java:104)
>> at
>> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(=
WALSplitterHandler.java:72)
>> at
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav=
a:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja=
va:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 2016-01-27 05:54:02,404 WARN  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
>> coordination.ZkSplitLogWorkerCoordination: Failed to heartbeat the
>> task/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com
>> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com
>> %252C9104%252C1452811286456.default.1453728374800
>>
>> I=E2=80=99m attaching the full log of the RS this was extracted from, wh=
ich I
>> just restarted on 1.1.3, in case that=E2=80=99s of any help.
>>
>> I=E2=80=99ve never seen this before and after a bit of digging, I=E2=80=
=99m not really
>> going anywhere.  Any ideas / suggestions?
>>
>
> --
> Benoit "tsuna" Sigoure
>

--001a11c2f2daceacd9052a4d06de--