accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <>
Subject [jira] [Created] (ACCUMULO-3597) Metadata table load prevented by flush
Date Fri, 13 Feb 2015 21:45:12 GMT
Keith Turner created ACCUMULO-3597:

             Summary: Metadata table load prevented by flush
                 Key: ACCUMULO-3597
             Project: Accumulo
          Issue Type: Bug
            Reporter: Keith Turner
             Fix For: 1.7.0, 1.6.3

Was running random walk test against 1.6.2 RC5 on a 20 node EC2 cluster.   Everything hung
because a metadata table was not loading.   I think the problem was a flush message.

On this cluster the master was and the tserver that was supposed to load a metadata
tablet was

Below is the root tablet entries for the problem metadata tablet showing it has a future location

!0< file:hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet/A0000xs5.rf []    59542,7512
!0< file:hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet/F0000xs8.rf []    8596,927
!0< file:hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet/F0000xs9.rf []    1735,70
!0< future:24b7ebf8cba00c3 []    ip-10-1-2-13:9997
!0< last:24b7ebf8cba00f4 []    ip-10-1-2-22:9997
!0< srv:compact []    39
!0< srv:dir []    hdfs://ip-10-1-2-11:9000/accumulo/tables/!0/default_tablet
!0< srv:flush []    39
!0< srv:lock []    tservers/ip-10-1-2-22:9997/zlock-0000000001$24b7ebf8cba00f4
!0< srv:time []    L193895
!0< ~tab:~pr []    \x0179dd555cc928f80d

Below shows grepping the tserver logs, nothing about loading the tablet.

$ grep 79dd555cc928f80d tserver_ip-10-1-2-13.ec2.internal.debug.log 
2015-02-12 20:24:49,526 [impl.ThriftScanner] DEBUG: Scan failed, not serving tablet (!0<;79dd555cc928f80d,ip-10-1-2-22:9997,24b7ebf8cba00f4)

Below {{netstat -nape}} run on the tserver shows alot of backed up data from master to tserver.
 I suspect the tablet load messages are in this backed up data.

tcp   471408      0                 ESTABLISHED 500
       659703     30785/java

Below is a flush thread on the tserver stuck waiting to update the problem metadata tablet.

"ClientPool 420" daemon prio=10 tid=0x0000000038b72000 nid=0x5f4e waiting on condition [0x00007fea175c8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.accumulo.core.util.UtilWaitThread.sleep(
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(
        at org.apache.accumulo.core.client.impl.Writer.update(
        at org.apache.accumulo.server.util.MetadataTableUtil.update(
        at org.apache.accumulo.server.util.MetadataTableUtil.update(
        at org.apache.accumulo.server.util.MetadataTableUtil.updateTabletFlushID(
        at org.apache.accumulo.tserver.Tablet.flush(
        at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.flush(
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
        at java.lang.reflect.Method.invoke(
        at org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(
        at org.apache.accumulo.server.util.RpcWrapper$1.invoke(
        at com.sun.proxy.$Proxy22.flush(Unknown Source)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$flush.getResult(
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$flush.getResult(
        at org.apache.thrift.ProcessFunction.process(
        at org.apache.thrift.TBaseProcessor.process(
        at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(
        at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(
        at org.apache.accumulo.server.util.CustomNonBlockingServer$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

Below are the loadTablet and flush messages from thrift.  I think the master sent a oneway
flush call, then a one way load tablet call over the same connection.  I think the flush blocked
waiting for the tablet to load and the flush was preventing the tablet from loading.

  oneway void loadTablet(5:trace.TInfo tinfo, 1:security.TCredentials credentials, 4:string
lock, 2:data.TKeyExtent extent),
  oneway void flush(4:trace.TInfo tinfo, 1:security.TCredentials credentials, 3:string lock,
2:string tableId, 5:binary startRow, 6:binary endRow),

This message was sent by Atlassian JIRA

View raw message