accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4060) Transient ZooKeeper connection issues kills FATE Runner threads
Date Thu, 19 Nov 2015 23:01:11 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014670#comment-15014670
] 

ASF GitHub Bot commented on ACCUMULO-4060:
------------------------------------------

Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/52#issuecomment-158227531
  
    > Seems like it would be simpler to modify transaction runner and add a try/catch/log
just inside the while loop.
    
    We very well could do this as well. I was hoping to pick your brain on any worries in
just eating those exceptions. I suppose in the end it's no different.


> Transient ZooKeeper connection issues kills FATE Runner threads
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-4060
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.1, 1.8.0
>
>
> Noticed this the following on a 6 node Accumulo cluster with Kerberos and quality of
protection set to auth-conf (wire encryption). The cluster appeared to be up and running --
healthy. Attempts to create a table via the shell was hung in the CreateTableCommand, polling
on the FATE operation. After a few minutes, it made no progress.
> Inspecting the FATE transactions showed that there were (multiple) FATE ops running,
but none where locked or locking any tables, nor making any progress.
> This lead me to inspect the Master's log to figure out why it wasn't making any progress,
and, to my joy, I found the following:
> {noformat}
> 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> {noformat}
> This happened at the end of a ~30s period of difficulties in the Master communicating
with ZooKeeper. I've yet to investigate why this pause happened, but the fact that the FATE
runner threads died and the Master kept running is no good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message