Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AF78B181BE for ; Thu, 19 Nov 2015 23:20:11 +0000 (UTC) Received: (qmail 54765 invoked by uid 500); 19 Nov 2015 23:20:11 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 54711 invoked by uid 500); 19 Nov 2015 23:20:11 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 54296 invoked by uid 99); 19 Nov 2015 23:20:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Nov 2015 23:20:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 319872C0453 for ; Thu, 19 Nov 2015 23:20:11 +0000 (UTC) Date: Thu, 19 Nov 2015 23:20:11 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-4060) Transient ZooKeeper connection issues kills FATE Runner threads MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014720#comment-15014720 ] ASF GitHub Bot commented on ACCUMULO-4060: ------------------------------------------ Github user keith-turner commented on the pull request: https://github.com/apache/accumulo/pull/52#issuecomment-158231342 > eating those exceptions don't just eat them, log them. > I suppose in the end it's no different. Yeah I can't see any differences between the approaches. > Transient ZooKeeper connection issues kills FATE Runner threads > --------------------------------------------------------------- > > Key: ACCUMULO-4060 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4060 > Project: Accumulo > Issue Type: Bug > Components: fate, master > Reporter: Josh Elser > Assignee: Josh Elser > Fix For: 1.7.1, 1.8.0 > > > Noticed this the following on a 6 node Accumulo cluster with Kerberos and quality of protection set to auth-conf (wire encryption). The cluster appeared to be up and running -- healthy. Attempts to create a table via the shell was hung in the CreateTableCommand, polling on the FATE operation. After a few minutes, it made no progress. > Inspecting the FATE transactions showed that there were (multiple) FATE ops running, but none where locked or locking any tables, nor making any progress. > This lead me to inspect the Master's log to figure out why it wasn't making any progress, and, to my joy, I found the following: > {noformat} > 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189) > at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158) > at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151) > at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128) > ... 6 more > {noformat} > This happened at the end of a ~30s period of difficulties in the Master communicating with ZooKeeper. I've yet to investigate why this pause happened, but the fact that the FATE runner threads died and the Master kept running is no good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)