accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock
Date Fri, 14 Nov 2014 18:33:34 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212589#comment-14212589
] 

Josh Elser edited comment on ACCUMULO-3336 at 11/14/14 6:33 PM:
----------------------------------------------------------------

Right before the ZK lock loss, across all of the logs (master + 2tservers) there was a 40s
pause here there was absolutely no logging at all. Seems like there was just some system pause/screw-up.
Obviously, the cause for the session & lock loss are outside the control of Accumulo in
this case.

I remember talking with you about ACCUMULO-3242 and you had mentioned that we bail out typically
in the ZooLock code (assuming that if it's losing its lock, the server should *likely* go
down, or it's at least the safest decision to make). At that time, I had punted on the subject.
Want to rekindle that conversation? It is plausible that such a transient situation might
arise from some network partition in which we should, hypothetically, be able to recover gracefully
from. I'd need to look into what it would actually take to ensure that a tabletserver doesn't
perform any actions while it doesn't have the lock. My hunch is that it would require a significant
refactoring of the tserver and master.

Also, assuming my hunch is correct, we would need to be able to create some sort of environment
which we could actually test and assert that the server processes are doing the correct thing,
not to mention update things like the agitator to inject these types of faults.


was (Author: elserj):
Right before the ZK lock loss, across all of the logs (master + 2tservers) there was a 40s
pause here there was absolutely no logging at all. Seems like there was just some system pause/screw-up.
Obviously, the cause for the session & lock loss are outside the control of Accumulo in
this case.

I remember talking with you about ACCUMULO-3242 and you had mentioned that we bail out typically
in the ZooLock code (assuming that if it's losing its lock, the server should *likely* go
down, or it's at least the safest decision to make). At that time, I had punted on the subject.
Want to rekindle that conversation? It is plausible that such a transient situation might
arise from some network partition in which we should, hypothetically, be able to recover gracefully
from. I'd need to look into what it would actually take to ensure that a tabletserver doesn't
perform any actions while it doesn't have the lock. My hunch is that it would require a significant
refactoring of the tserver and master.

> ZK session reconnect still results in loss of ZK lock
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3336
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
>             Project: Accumulo
>          Issue Type: Bug
>          Components: zookeeper
>    Affects Versions: 1.5.2, 1.6.1
>            Reporter: Josh Elser
>             Fix For: 1.7.0
>
>
> Saw the following
> {noformat}
> 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
> 	at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
> 	at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
> 	at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
> 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected
> 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception
communicating with ZooKeeper
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
> 	at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception
communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
> 	at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
> 	at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
> 	at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session
to localhost:12644
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with
timeout 30000 with auth
> 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00)
secs freemem=118,013,904(+6,412,200) totalmem=129,761,280
> 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in
a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check
> 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935
!0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper:
/accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper
event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current
session : Expired
> 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
> 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason
= SESSION_EXPIRED), exiting.
> {noformat}
> ZooKeeper code appears to had disconnected, closed the disconnected connection and then
opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.
> If we want to support this, it might require rehashing some of the ZooLock code (to prevent
the tserver from processing while the tserver doesn't have its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message