Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C158D188FB for ; Thu, 19 Nov 2015 06:19:11 +0000 (UTC) Received: (qmail 60366 invoked by uid 500); 19 Nov 2015 06:19:11 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 60330 invoked by uid 500); 19 Nov 2015 06:19:11 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 60120 invoked by uid 99); 19 Nov 2015 06:19:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Nov 2015 06:19:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 30DB72C1F5E for ; Thu, 19 Nov 2015 06:19:11 +0000 (UTC) Date: Thu, 19 Nov 2015 06:19:11 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-4060) Transient ZooKeeper connection issues kills FATE Runner threads MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-4060?page=3Dcom.atlass= ian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1= 5012921#comment-15012921 ]=20 ASF GitHub Bot commented on ACCUMULO-4060: ------------------------------------------ GitHub user joshelser opened a pull request: https://github.com/apache/accumulo/pull/52 ACCUMULO-4060 Run a timer task to restart failed FATE repo runner thr= =E2=80=A6 =E2=80=A6eads. =20 If ZK becomes unavailable for some period of time, it's possible that t= he FATE repo runner threads inside of the master will terminate without the master itself dying. =20 An attempt at an implementation to "recover gracefully" when the repo-r= unner threads die. You can merge this pull request into a Git repository by running: $ git pull https://github.com/joshelser/accumulo ACCUMULO-4060-reporunn= er Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/52.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #52 =20 ---- commit 149df86832d4a417f83175852b4b9f785650ccac Author: Josh Elser Date: 2015-11-19T05:35:40Z ACCUMULO-4060 Run a timer task to restart failed FATE repo runner threa= ds. =20 If ZK becomes unavailable for some period of time, it's possible that t= he FATE repo runner threads inside of the master will terminate without the master itself dying. ---- > Transient ZooKeeper connection issues kills FATE Runner threads > --------------------------------------------------------------- > > Key: ACCUMULO-4060 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4060 > Project: Accumulo > Issue Type: Bug > Components: fate, master > Reporter: Josh Elser > Assignee: Josh Elser > Fix For: 1.7.1, 1.8.0 > > > Noticed this the following on a 6 node Accumulo cluster with Kerberos and= quality of protection set to auth-conf (wire encryption). The cluster appe= ared to be up and running -- healthy. Attempts to create a table via the sh= ell was hung in the CreateTableCommand, polling on the FATE operation. Afte= r a few minutes, it made no progress. > Inspecting the FATE transactions showed that there were (multiple) FATE o= ps running, but none where locked or locking any tables, nor making any pro= gress. > This lead me to inspect the Master's log to figure out why it wasn't maki= ng any progress, and, to my joy, I found the following: > {noformat} > 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died or= g.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode= =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$Connecti= onLossException: KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-= 720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:= 158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:= 60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx= ecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolE= xecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunna= ble.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: = KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-58= 91010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472= ) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500= ) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooRe= ader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died or= g.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode= =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$Connecti= onLossException: KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-= 720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:= 158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:= 60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx= ecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolE= xecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunna= ble.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: = KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-58= 91010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472= ) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500= ) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooRe= ader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died or= g.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode= =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$Connecti= onLossException: KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-= 720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:= 158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:= 60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx= ecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolE= xecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunna= ble.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: = KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-58= 91010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472= ) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500= ) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooRe= ader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died or= g.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode= =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$Connecti= onLossException: KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-= 720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:= 158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:= 60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx= ecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolE= xecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunna= ble.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: = KeeperErrorCode =3D ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-58= 91010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.ja= va:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472= ) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500= ) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooRe= ader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > {noformat} > This happened at the end of a ~30s period of difficulties in the Master c= ommunicating with ZooKeeper. I've yet to investigate why this pause happene= d, but the fact that the FATE runner threads died and the Master kept runni= ng is no good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)