accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teng Qiu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock
Date Fri, 25 Mar 2016 10:56:25 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211695#comment-15211695
] 

Teng Qiu commented on ACCUMULO-3336:
------------------------------------

ah, one additional info is, each tserver process is running in docker container in an ec2
instance, we didn't changed vm swap setting, not sure if we should change cgroup setting on
host or set "--memory-swap=-1" for docker container.

but anyway, the tserver should be reconnected to zoopeeker after it come up again...

> ZK session reconnect still results in loss of ZK lock
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3336
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
>             Project: Accumulo
>          Issue Type: Bug
>          Components: zookeeper
>    Affects Versions: 1.5.2, 1.6.1
>            Reporter: Josh Elser
>             Fix For: 1.8.0
>
>
> Saw the following
> {noformat}
> 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
> 	at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
> 	at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
> 	at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
> 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected
> 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception
communicating with ZooKeeper
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
> 	at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
> 	at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
> 	at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
> 	at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00)
secs freemem=118,013,904(+6,412,200) totalmem=129,761,280
> 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in
a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check
> 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935
!0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper:
/accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current
session : Expired
> 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
> 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason
= SESSION_EXPIRED), exiting.
> {noformat}
> ZooKeeper code appears to had disconnected, closed the disconnected connection and then
opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.
> If we want to support this, it might require rehashing some of the ZooLock code (to prevent
the tserver from processing while the tserver doesn't have its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message