accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock
Date Fri, 14 Nov 2014 16:16:34 GMT
Josh Elser created ACCUMULO-3336:
------------------------------------

             Summary: ZK session reconnect still results in loss of ZK lock
                 Key: ACCUMULO-3336
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
             Project: Accumulo
          Issue Type: Bug
          Components: zookeeper
    Affects Versions: 1.6.1, 1.5.2
            Reporter: Josh Elser
             Fix For: 1.7.0


Saw the following

{noformat}
2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event:
None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating
with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
	at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
	at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
	at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
	at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event:
None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected
2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception communicating
with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
	at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
	at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating
with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
	at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
	at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
	at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
	at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
	at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
	at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
	at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
	at java.lang.Thread.run(Thread.java:745)
2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation
2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to
localhost:12644
2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout
30000 with auth
2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to
localhost:12644
2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout
30000 with auth
2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00)
secs freemem=118,013,904(+6,412,200) totalmem=129,761,280
2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in a timely
fashion. Expected every 5.0 seconds but was 43.1 seconds since last check
2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935 !0
1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper: /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event:
None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event:
None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event:
None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current session
: Expired
2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason = SESSION_EXPIRED),
exiting.
{noformat}

ZooKeeper code appears to had disconnected, closed the disconnected connection and then opened
a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.

If we want to support this, it might require rehashing some of the ZooLock code (to prevent
the tserver from processing while the tserver doesn't have its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message