Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBD3B17775 for ; Tue, 10 Mar 2015 05:18:38 +0000 (UTC) Received: (qmail 57787 invoked by uid 500); 10 Mar 2015 05:18:38 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 57747 invoked by uid 500); 10 Mar 2015 05:18:38 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 57451 invoked by uid 99); 10 Mar 2015 05:18:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Mar 2015 05:18:38 +0000 Date: Tue, 10 Mar 2015 05:18:38 +0000 (UTC) From: "Josh Elser (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3597) Metadata table load prevented by flush MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354351#comment-14354351 ] Josh Elser commented on ACCUMULO-3597: -------------------------------------- Might be cleaner to do save the conditional check {{extent.isRootTablet() || extent.isMeta()}}, construct the {{TabletClientService.Client}} (depending on the conditional check) and then check the conditional again in the {{finally}} to close or return the client. It removes the duplicate call to loadTablet. Another alternative would also be to set the arguments to {{loadTablet}} once and just reference local variables are the method arguments, but I think the above is less error prone (although slightly more code to look at). {code} TTransport transport = ThriftUtil.createTransport(address, conf); {code} Do you think there's any worth in specifically caching these transports for root/meta loads? I'm not sure what the overhead in constructing a new Transport is and whether it's worth the additional complexity. Overall, I think it's fine with what you have now -- just wanted to give my thoughts since I read the patch. Take the suggestions as you see fit. > Metadata table load prevented by flush > -------------------------------------- > > Key: ACCUMULO-3597 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3597 > Project: Accumulo > Issue Type: Bug > Affects Versions: 1.6.0, 1.6.1, 1.6.2 > Reporter: Keith Turner > Fix For: 1.7.0, 1.6.3 > > Attachments: ACCUMULO-3597-1.patch > > > Was running random walk test against 1.6.2 RC5 on a 20 node EC2 cluster. Everything hung because a metadata table was not loading. I think the problem was a flush message. > On this cluster the master was 10.1.2.10 and the tserver that was supposed to load a metadata tablet was 10.1.2.13. > Below is the root tablet entries for the problem metadata tablet showing it has a future location of 10.1.2.13. > {noformat} > !0< file:hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet/A0000xs5.rf [] 59542,7512 > !0< file:hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet/F0000xs8.rf [] 8596,927 > !0< file:hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet/F0000xs9.rf [] 1735,70 > !0< future:24b7ebf8cba00c3 [] ip-10-1-2-13:9997 > !0< last:24b7ebf8cba00f4 [] ip-10-1-2-22:9997 > !0< srv:compact [] 39 > !0< srv:dir [] hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet > !0< srv:flush [] 39 > !0< srv:lock [] tservers/ip-10-1-2-22:9997/zlock-0000000001$24b7ebf8cba00f4 > !0< srv:time [] L193895 > !0< ~tab:~pr [] \x0179dd555cc928f80d > {noformat} > Below shows grepping the tserver logs, nothing about loading the tablet. > {noformat} > $ grep 79dd555cc928f80d tserver_ip-10-1-2-13.ec2.internal.debug.log > 2015-02-12 20:24:49,526 [impl.ThriftScanner] DEBUG: Scan failed, not serving tablet (!0<;79dd555cc928f80d,ip-10-1-2-22:9997,24b7ebf8cba00f4) > {noformat} > Below {{netstat -nape}} run on the tserver shows alot of backed up data from master to tserver. I suspect the tablet load messages are in this backed up data. > {noformat} > tcp 471408 0 10.1.2.13:9997 10.1.2.10:51271 ESTABLISHED 500 659703 30785/java > {noformat} > Below is a flush thread on the tserver stuck waiting to update the problem metadata tablet. > {noformat} > "ClientPool 420" daemon prio=10 tid=0x0000000038b72000 nid=0x5f4e waiting on condition [0x00007fea175c8000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:26) > at org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:442) > at org.apache.accumulo.core.client.impl.Writer.update(Writer.java:85) > at org.apache.accumulo.server.util.MetadataTableUtil.update(MetadataTableUtil.java:143) > at org.apache.accumulo.server.util.MetadataTableUtil.update(MetadataTableUtil.java:135) > at org.apache.accumulo.server.util.MetadataTableUtil.updateTabletFlushID(MetadataTableUtil.java:164) > at org.apache.accumulo.tserver.Tablet.flush(Tablet.java:2227) > at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.flush(TabletServer.java:2380) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46) > at org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:47) > at com.sun.proxy.$Proxy22.flush(Unknown Source) > at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$flush.getResult(TabletClientService.java:2595) > at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$flush.getResult(TabletClientService.java:2581) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168) > at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516) > at org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) > at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:744) > {noformat} > Below are the loadTablet and flush messages from thrift. I think the master sent a oneway flush call, then a one way load tablet call over the same connection. I think the flush blocked waiting for the tablet to load and the flush was preventing the tablet from loading. > {noformat} > oneway void loadTablet(5:trace.TInfo tinfo, 1:security.TCredentials credentials, 4:string lock, 2:data.TKeyExtent extent), > oneway void flush(4:trace.TInfo tinfo, 1:security.TCredentials credentials, 3:string lock, 2:string tableId, 5:binary startRow, 6:binary endRow), > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)