Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF42018A42 for ; Wed, 27 Jan 2016 09:00:13 +0000 (UTC) Received: (qmail 47247 invoked by uid 500); 27 Jan 2016 09:00:11 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 47174 invoked by uid 500); 27 Jan 2016 09:00:11 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 47162 invoked by uid 99); 27 Jan 2016 09:00:11 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jan 2016 09:00:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 1811EC2803 for ; Wed, 27 Jan 2016 09:00:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id RkxV7BWUfIee for ; Wed, 27 Jan 2016 08:59:59 +0000 (UTC) Received: from mail-ob0-f179.google.com (mail-ob0-f179.google.com [209.85.214.179]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 4BEBC20426 for ; Wed, 27 Jan 2016 08:59:59 +0000 (UTC) Received: by mail-ob0-f179.google.com with SMTP id zv1so1831086obb.2 for ; Wed, 27 Jan 2016 00:59:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=j1UUBEgOpDoRgRNv512Pgiw+G4s0Vz0Phzq2b7Esd74=; b=PSTZrM3KHXdaJ4j44W07rzNskAqg1Vj147qaHF1ELssphiQV9us4Tot2VjYgYWcNQA XHLNXf6crV/XhmeBkvUW6wxdtdZylxg7D8XyMzvIqrRrwGQtx0lfmaM9EfxT7lIb50TH vXmsmv1T8OIzTgbW2iOeDnBEm/q/1sRvB7yLKcgTOc4Nefj/vKmESAaAloQ7ALfonwk1 KH6G+INL66ijxvU5nfr7SBJhbI2sWmOi6hVeGGVTlzmwpbGdXoHsSEVD2jtYbSSatlAb BoikQsB6hdGkXKm5svnlXt+JuKQuH2SBdipKL6lVVJ25tn0NGI12fDJqM6x20/EIag1k GM3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=j1UUBEgOpDoRgRNv512Pgiw+G4s0Vz0Phzq2b7Esd74=; b=f3NZjWyL9LtHspKktKy2H4Vjp/pKO8eBXrljaMm9sE6LA5FYuE1H91jEvbeoMvs/En /LnS1/yTVOQ16RtauL0d7d85ugTSHkkgALvUSqDrm5kNRleDQkcZQiMh3E4jWbIAlmkQ OSfKhQpJ10iXeuCgz9uLXmeTfCz9dGnHfXDa7Dy6rr6j+xNRjcSmTCcjgmI61/7oJEPL NqxUl9JN2LqJT2OobzEt6fsmXjAcZ1QRYxx8RrW+8y7WiWzhXgCJEbk7x6aNivjfbj/1 4uzRO9Grr8lNbu8BnT1VH+K59OrKXdhoFX28kOau1Z15il//j0YHUcGTv2JU+jaUOirF v8TQ== X-Gm-Message-State: AG10YORs8h82CBe9INtqJTIpEXecZGRV0Qbti5mkzWfJ5ysQKHDQW8Wc7+J97JctLRsKZ+/iWmlO2fw80/8t+w== MIME-Version: 1.0 X-Received: by 10.182.72.129 with SMTP id d1mr22423615obv.4.1453885198607; Wed, 27 Jan 2016 00:59:58 -0800 (PST) Received: by 10.76.8.10 with HTTP; Wed, 27 Jan 2016 00:59:58 -0800 (PST) In-Reply-To: References: Date: Wed, 27 Jan 2016 09:59:58 +0100 Message-ID: Subject: Re: All RegionServers stuck on BadVersion from ZK after cluster restart From: Samir Ahmic To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a11c2f2daceacd9052a4d06de --001a11c2f2daceacd9052a4d06de Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, This looks like some WALs splitting operations are not able to finish causing master to abort. First thing i would try is to stop cluster, clear zk data and then try to start it again. If that does not solve issue you probably have case of corrupted WALs files. In that case you will have to locate corrupted WALs files on hdfs and remove them in order to start master. Cheers Samir On Wed, Jan 27, 2016 at 7:26 AM, tsuna wrote: > BTW, I already left the cluster in this state for 24h, so I know it=E2=80= =99s > stuck for good. > > Since I just restarted everything about half an hour ago, I can tell you > that I now have 31 regions online, and 278 stuck offline, and things aren= =E2=80=99t > making progress. The master is just waiting for log splitting to happen, > and it regularly dies and restarts because it can=E2=80=99t complete init= ialization. > > In case that=E2=80=99s of any help, I=E2=80=99m attaching the last 1000 l= ines of the > master log, which show that the master is just waiting for stuff to happe= n > and then committing suicide and restarting and waiting again. > > 2016-01-27 06:21:30,871 INFO [r12s3.sjc.aristanetworks.com,9102,14538743= 20326_splitLogManager__ChoreService_1] > master.SplitLogManager: total tasks =3D 9 unassigned =3D 0 > tasks=3D{/hbase/splitWAL/WALs%2Fr12s6.sjc.aristanetworks.com > %2C9104%2C1452811286154-splitting%2Fr12s6.sjc.aristanetworks.com%252C9104= %252C1452811286154.default.1453728483550=3Dlast_update > =3D 1453875668700 last_version =3D 30 cur_worker_name =3D > r12s2.sjc.aristanetworks.com,9104,1453873919713 status =3D in_progress > incarnation =3D 0 resubmits =3D 0 batch =3D installed =3D 1 done =3D 0 er= ror =3D 0, > /hbase/splitWAL/WALs%2Fr12s8.sjc.aristanetworks.com > %2C9104%2C1452811287196-splitting%2Fr12s8.sjc.aristanetworks.com%252C9104= %252C1452811287196.default.1453728125912=3Dlast_update > =3D 1453875650945 last_version =3D 29 cur_worker_name =3D > r12s13.sjc.aristanetworks.com,9104,1453873919678 status =3D in_progress > incarnation =3D 0 resubmits =3D 0 batch =3D installed =3D 1 done =3D 0 er= ror =3D 0, > /hbase/splitWAL/WALs%2Fr12s14.sjc.aristanetworks.com > %2C9104%2C1452811286288-splitting%2Fr12s14.sjc.aristanetworks.com%252C910= 4%252C1452811286288.default.1453728336644=3Dlast_update > =3D 1453875673936 last_version =3D 30 cur_worker_name =3D > r12s11.sjc.aristanetworks.com,9104,1453873919623 status =3D in_progress > incarnation =3D 0 resubmits =3D 0 batch =3D installed =3D 1 done =3D 0 er= ror =3D 0, > /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com > %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com%252C910= 4%252C1452811286456.default.1453728374800=3Dlast_update > =3D 1453875666968 last_version =3D 3593 cur_worker_name =3D > r12s10.sjc.aristanetworks.com,9104,1453873919762 status =3D in_progress > incarnation =3D 2 resubmits =3D 2 batch =3D installed =3D 1 done =3D 0 er= ror =3D 0, > /hbase/splitWAL/WALs%2Fr12s13.sjc.aristanetworks.com > %2C9104%2C1452811287287-splitting%2Fr12s13.sjc.aristanetworks.com%252C910= 4%252C1452811287287.default.1453728621698=3Dlast_update > =3D 1453875666969 last_version =3D 3650 cur_worker_name =3D > r12s15.sjc.aristanetworks.com,9104,1453873919845 status =3D in_progress > incarnation =3D 2 resubmits =3D 2 batch =3D null, /hbase/splitWAL/WALs% > 2Fr12s4.sjc.aristanetworks.com%2C9104%2C1452811289547-splitting% > 2Fr12s4.sjc.aristanetworks.com%252C9104%252C1452811289547.default.1453728= 949397=3Dlast_update > =3D 1453875666969 last_version =3D 3658 cur_worker_name =3D > r12s9.sjc.aristanetworks.com,9104,1453873919799 status =3D in_progress > incarnation =3D 1 resubmits =3D 1 batch =3D installed =3D 1 done =3D 0 er= ror =3D 0, > /hbase/splitWAL/WALs%2Fr12s5.sjc.aristanetworks.com > %2C9104%2C1452811125084-splitting%2Fr12s5.sjc.aristanetworks.com%252C9104= %252C1452811125084.default.1453728624262=3Dlast_update > =3D 1453875666968 last_version =3D 3437 cur_worker_name =3D > r12s1.sjc.aristanetworks.com,9104,1453873919645 status =3D in_progress > incarnation =3D 2 resubmits =3D 2 batch =3D null, /hbase/splitWAL/WALs% > 2Fr12s1.sjc.aristanetworks.com%2C9104%2C1452811288618-splitting% > 2Fr12s1.sjc.aristanetworks.com%252C9104%252C1452811288618.default.1453728= 791276=3Dlast_update > =3D 1453875666968 last_version =3D 2280 cur_worker_name =3D > r12s14.sjc.aristanetworks.com,9104,1453873919711 status =3D in_progress > incarnation =3D 2 resubmits =3D 1 batch =3D null, /hbase/splitWAL/WALs% > 2Fr12s2.sjc.aristanetworks.com%2C9104%2C1452811286448-splitting% > 2Fr12s2.sjc.aristanetworks.com%252C9104%252C1452811286448.default.1453728= 137282=3Dlast_update > =3D 1453875666968 last_version =3D 2219 cur_worker_name =3D > r12s4.sjc.aristanetworks.com,9104,1453873919688 status =3D in_progress > incarnation =3D 2 resubmits =3D 1 batch =3D null} > > On Tue, Jan 26, 2016 at 10:02 PM, tsuna wrote: > >> Hi, >> after a planned power outage one of our HBase clusters isn=E2=80=99t com= ing back >> up healthy. The master shows the 16 region servers but zero regions. A= ll >> the RegionServers are experiencing the same problem, which is that they= =E2=80=99re >> getting a BadVersion error from ZooKeeper. This was with HBase 1.1.2 >> and I just upgraded all the nodes to 1.1.3 to see if this would make a >> difference, but it didn=E2=80=99t. >> >> 2016-01-27 05:54:02,402 WARN [RS_LOG_REPLAY_OPS-r12s4:9104-0] >> coordination.ZkSplitLogWorkerCoordination: BADVERSION failed to assert >> ownership for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com >> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com >> %252C9104%252C1452811286456.default.1453728374800 >> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCod= e >> =3D BadVersion for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com >> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com >> %252C9104%252C1452811286456.default.1453728374800 >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) >> at >> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(Recoverab= leZooKeeper.java:429) >> at >> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination.attemp= tToOwnTask(ZkSplitLogWorkerCoordination.java:370) >> at >> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$1.prog= ress(ZkSplitLogWorkerCoordination.java:304) >> at >> org.apache.hadoop.hbase.util.FSHDFSUtils.checkIfCancelled(FSHDFSUtils.ja= va:329) >> at >> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils= .java:244) >> at >> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.ja= va:162) >> at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:76= 1) >> at >> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:29= 7) >> at >> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:23= 5) >> at >> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorke= r.java:104) >> at >> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(= WALSplitterHandler.java:72) >> at >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav= a:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja= va:617) >> at java.lang.Thread.run(Thread.java:745) >> 2016-01-27 05:54:02,404 WARN [RS_LOG_REPLAY_OPS-r12s4:9104-0] >> coordination.ZkSplitLogWorkerCoordination: Failed to heartbeat the >> task/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com >> %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com >> %252C9104%252C1452811286456.default.1453728374800 >> >> I=E2=80=99m attaching the full log of the RS this was extracted from, wh= ich I >> just restarted on 1.1.3, in case that=E2=80=99s of any help. >> >> I=E2=80=99ve never seen this before and after a bit of digging, I=E2=80= =99m not really >> going anywhere. Any ideas / suggestions? >> > > -- > Benoit "tsuna" Sigoure > --001a11c2f2daceacd9052a4d06de--