accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tristen Georgiou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock
Date Wed, 17 Dec 2014 20:22:15 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250466#comment-14250466
] 

Tristen Georgiou commented on ACCUMULO-3336:
--------------------------------------------

I reverted my configurations back to their original, and decided to try moving the Accumulo
GC process (although since this is like a JVM GC problem, I could have moved any of the processes
I imagine) to a different node than the node where the bulk of the Accumulo processes are
running, and encountered the same error after about 35 minutes.

The tablet server that died, is on the same node as where I moved the Accumulo GC process,
but this node is also running a bunch of other Hadoop processes, so I'm not sure if this provides
any useful information. 15 minutes later, the node where the bulk of the Accumulo processes
are running died.

>From your comments, I'm guessing that my nodes are overloaded. Ideally, am I right in
assuming a tablet server would only run a data node and resource manager node? Thanks for
the help so far!

> ZK session reconnect still results in loss of ZK lock
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3336
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
>             Project: Accumulo
>          Issue Type: Bug
>          Components: zookeeper
>    Affects Versions: 1.5.2, 1.6.1
>            Reporter: Josh Elser
>             Fix For: 1.7.0
>
>
> Saw the following
> {noformat}
> 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
> 	at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
> 	at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
> 	at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
> 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected
> 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception
communicating with ZooKeeper
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
> 	at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
> 	at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
> 	at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
> 	at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00)
secs freemem=118,013,904(+6,412,200) totalmem=129,761,280
> 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in
a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check
> 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935
!0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper:
/accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current
session : Expired
> 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
> 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason
= SESSION_EXPIRED), exiting.
> {noformat}
> ZooKeeper code appears to had disconnected, closed the disconnected connection and then
opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.
> If we want to support this, it might require rehashing some of the ZooLock code (to prevent
the tserver from processing while the tserver doesn't have its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message