Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 032C81897F for ; Thu, 6 Aug 2015 17:15:08 +0000 (UTC) Received: (qmail 50585 invoked by uid 500); 6 Aug 2015 17:15:07 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 50498 invoked by uid 500); 6 Aug 2015 17:15:07 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 50209 invoked by uid 99); 6 Aug 2015 17:15:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2015 17:15:07 +0000 Date: Thu, 6 Aug 2015 17:15:07 +0000 (UTC) From: "Josh Elser (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3954) Demote warning/error about RPCs received in tabletserver before lock is acquired MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660377#comment-14660377 ] Josh Elser commented on ACCUMULO-3954: -------------------------------------- Actually, I'm wondering if we can actually fix the race condition in the TabletServer itself in the {{announceExistence}} method. It appears that we: 1. Advertise our existence in zookeeper 2. Acquire TabletServer zookeeper lock It seems like this is backwards and that we should acquire the lock first, and then tell the world that we exist. The monitor saw the tabletserver existed (via step 1) and happened to call getActiveScans before step 2 happened. If we reverse the order in the tabletserver, it should be impossible for the monitor to cause this. Also, it's worth mentioning that the log message is wrong too (the rpc came from the monitor, not the master). We should just make the warning message more generic, e.g. "got $method message before lock was acquired". > Demote warning/error about RPCs received in tabletserver before lock is acquired > -------------------------------------------------------------------------------- > > Key: ACCUMULO-3954 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3954 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.6.3, 1.7.0 > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Trivial > Fix For: 1.8.0 > > > Noticed this error today on the monitor after restarting Accumulo: > {noformat} > 2015-08-06 16:57:10,788 [tserver.TabletServer] WARN : tserver:hostname Got getScans message from master before lock acquired, ignoring... > 2015-08-06 16:57:10,791 [tserver.TabletServer$ThriftClientHandler] ERROR: tserver:jelser-phoenix-1.openstacklocal Lock not acquired > java.lang.RuntimeException: Lock not acquired > at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1400) > at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.getActiveScans(TabletServer.java:1622) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.accumulo.core.trace.wrappers.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:39) > at org.apache.accumulo.server.rpc.RpcWrapper$1.invoke(RpcWrapper.java:47) > at com.sun.proxy.$Proxy21.getActiveScans(Unknown Source) > at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$getActiveScans.getResult(TabletClientService.java:2864) > at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$getActiveScans.getResult(TabletClientService.java:2848) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:63) > at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516) > at org.apache.accumulo.server.rpc.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:78) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:745) > {noformat} > I don't think this should be bubbling up to the monitor as an error. I believe it is an entirely normal race condition that can happen. If the tabletserver is not ready to accept an RPC, it can log a message at debug. The error condition would be a tabletserver never acquiring its lock (and thus should be handled elsewhere). -- This message was sent by Atlassian JIRA (v6.3.4#6332)