accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4060) Transient ZooKeeper connection issues kills FATE Runner threads
Date Thu, 19 Nov 2015 06:19:11 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012921#comment-15012921
] 

ASF GitHub Bot commented on ACCUMULO-4060:
------------------------------------------

GitHub user joshelser opened a pull request:

    https://github.com/apache/accumulo/pull/52

    ACCUMULO-4060 Run a timer task to restart failed FATE repo runner thr…

    …eads.
    
    If ZK becomes unavailable for some period of time, it's possible that the
    FATE repo runner threads inside of the master will terminate without
    the master itself dying.
    
    An attempt at an implementation to "recover gracefully" when the repo-runner threads die.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/joshelser/accumulo ACCUMULO-4060-reporunner

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/52.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #52
    
----
commit 149df86832d4a417f83175852b4b9f785650ccac
Author: Josh Elser <elserj@apache.org>
Date:   2015-11-19T05:35:40Z

    ACCUMULO-4060 Run a timer task to restart failed FATE repo runner threads.
    
    If ZK becomes unavailable for some period of time, it's possible that the
    FATE repo runner threads inside of the master will terminate without
    the master itself dying.

----


> Transient ZooKeeper connection issues kills FATE Runner threads
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-4060
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.1, 1.8.0
>
>
> Noticed this the following on a 6 node Accumulo cluster with Kerberos and quality of
protection set to auth-conf (wire encryption). The cluster appeared to be up and running --
healthy. Attempts to create a table via the shell was hung in the CreateTableCommand, polling
on the FATE operation. After a few minutes, it made no progress.
> Inspecting the FATE transactions showed that there were (multiple) FATE ops running,
but none where locked or locking any tables, nor making any progress.
> This lead me to inspect the Master's log to figure out why it wasn't making any progress,
and, to my joy, I found the following:
> {noformat}
> 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> {noformat}
> This happened at the end of a ~30s period of difficulties in the Master communicating
with ZooKeeper. I've yet to investigate why this pause happened, but the fact that the FATE
runner threads died and the Master kept running is no good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message