accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock
Date Wed, 17 Dec 2014 17:46:14 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250189#comment-14250189
] 

Josh Elser edited comment on ACCUMULO-3336 at 12/17/14 5:45 PM:
----------------------------------------------------------------

[~tristeng], thanks for commenting. It sounds like you're getting a bit mixed up in terminology.

The sections from the user manual that you're referring to are for JVM garbage collection
which is a common pain-point for losing ZK locks.

The property "gc.cycle.delay" controls how often the Accumulo garbage collector process is
running. The Accumulo garbage collector is just cleaning up assorted files on HDFS that are
no longer needed. The only way that the garbage collector process is going to interfere with
TabletServer JVM GC is if there is shared contention through the namenode (which doesn't sound
like what you're running into).


was (Author: elserj):
[~tristeng], thanks for commenting. It sounds like you're getting a bit mixed up in terminology.

The sections from the user manual that you're referring to are for JVM garbage collection
which is a common pain-point for losing ZK locks.

The property "gc.cycle.delay" controls how often the Accumulo garbage collector process is
running. The Accumulo garbage collector is just cleaning up assorted files on HDFS that are
no longer needed.

> ZK session reconnect still results in loss of ZK lock
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3336
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
>             Project: Accumulo
>          Issue Type: Bug
>          Components: zookeeper
>    Affects Versions: 1.5.2, 1.6.1
>            Reporter: Josh Elser
>             Fix For: 1.7.0
>
>
> Saw the following
> {noformat}
> 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
> 	at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
> 	at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
> 	at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
> 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected
> 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception
communicating with ZooKeeper
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
> 	at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
> 	at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
> 	at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
> 	at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00)
secs freemem=118,013,904(+6,412,200) totalmem=129,761,280
> 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in
a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check
> 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935
!0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper:
/accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current
session : Expired
> 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
> 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason
= SESSION_EXPIRED), exiting.
> {noformat}
> ZooKeeper code appears to had disconnected, closed the disconnected connection and then
opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.
> If we want to support this, it might require rehashing some of the ZooLock code (to prevent
the tserver from processing while the tserver doesn't have its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message