Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B36EF8AC for ; Fri, 14 Nov 2014 18:12:34 +0000 (UTC) Received: (qmail 75384 invoked by uid 500); 14 Nov 2014 18:12:33 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 75350 invoked by uid 500); 14 Nov 2014 18:12:33 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 75339 invoked by uid 99); 14 Nov 2014 18:12:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Nov 2014 18:12:33 +0000 Date: Fri, 14 Nov 2014 18:12:33 +0000 (UTC) From: "Josh Elser (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212567#comment-14212567 ] Josh Elser commented on ACCUMULO-3336: -------------------------------------- I do have logs. I need to give them more than the cursory look I did before filing this (will attach as well). I saw the logs that I pasted in the description and thought they were suspect. Your assessment is also extremely plausible because I've been regularly seeing "normal execution" problems on the environment which I saw this. > ZK session reconnect still results in loss of ZK lock > ----------------------------------------------------- > > Key: ACCUMULO-3336 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3336 > Project: Accumulo > Issue Type: Bug > Components: zookeeper > Affects Versions: 1.5.2, 1.6.1 > Reporter: Josh Elser > Fix For: 1.7.0 > > > Saw the following > {noformat} > 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq > 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260) > at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232) > at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96) > at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65) > at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90) > at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136) > at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84) > at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery > 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected > 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception communicating with ZooKeeper > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109) > at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260) > at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232) > at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304) > at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47) > at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85) > at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98) > at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107) > at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103) > at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193) > at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636) > at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:745) > 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation > 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to localhost:12644 > 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout 30000 with auth > 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to localhost:12644 > 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout 30000 with auth > 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00) secs freemem=118,013,904(+6,412,200) totalmem=129,761,280 > 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check > 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq > 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery > 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935 !0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] > 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper: /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers > 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: [] > 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue > 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq > 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery > 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current session : Expired > 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired > 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason = SESSION_EXPIRED), exiting. > {noformat} > ZooKeeper code appears to had disconnected, closed the disconnected connection and then opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver. > If we want to support this, it might require rehashing some of the ZooLock code (to prevent the tserver from processing while the tserver doesn't have its lock). -- This message was sent by Atlassian JIRA (v6.3.4#6332)