accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-4060) Transient ZooKeeper connection issues kills FATE Runner threads
Date Thu, 19 Nov 2015 03:31:10 GMT
Josh Elser created ACCUMULO-4060:
------------------------------------

             Summary: Transient ZooKeeper connection issues kills FATE Runner threads
                 Key: ACCUMULO-4060
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
             Project: Accumulo
          Issue Type: Bug
          Components: fate, master
            Reporter: Josh Elser
            Assignee: Josh Elser
             Fix For: 1.7.1, 1.8.0


Noticed this the following on a 6 node Accumulo cluster with Kerberos and quality of protection
set to auth-conf (wire encryption). The cluster appeared to be up and running -- healthy.
Attempts to create a table via the shell was hung in the CreateTableCommand, polling on the
FATE operation. After a few minutes, it made no progress.

Inspecting the FATE transactions showed that there were (multiple) FATE ops running, but none
where locked or locking any tables, nor making any progress.

This lead me to inspect the Master's log to figure out why it wasn't making any progress,
and, to my joy, I found the following:

{noformat}
2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
        at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
        at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
        at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
        ... 6 more
2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
        at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
        at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
        at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
        ... 6 more
2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
        at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
        at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
        at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
        ... 6 more
2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
        at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
        at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
        at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
        at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
        ... 6 more
{noformat}

This happened at the end of a ~30s period of difficulties in the Master communicating with
ZooKeeper. I've yet to investigate why this pause happened, but the fact that the FATE runner
threads died and the Master kept running is no good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message