accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3914) Restarting HDFS caused scan to fail
Date Sun, 28 Jun 2015 16:55:05 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604777#comment-14604777
] 

Steve Loughran commented on ACCUMULO-3914:
------------------------------------------

If HDFS threw this in the {{RetryInvocationHandler}}, then it made that decision based the
outcome of the communication attempts and the retry policy. Clearly it decided that the exception
wasn't going to be retried, yet, as it is one of those exceptions which the {{FailoverOnNetworkExceptionRetry}}
policy will retry on, either a different policy was in use, or the client gave up trying.

Rather than wrap another catch/repeat layer on top, one that (in the patch) doesn't try and
be discriminating against which exceptions are worth retrying, and which arent (security exceptions,
io interrupted. ...,) I'd recommend making sure that {{"dfs.client.retry.policy.enabled"}}
= true and look at some of the other tunable parameters. HDFS client should be able to adopt
a retry policy that suits: if it doesn't, that's something where there may be scope for improvement.


Summary: select the appropriate DFS client retry policy; ideally your tests should use the
same ones you recommend for production.

> Restarting HDFS caused scan to fail 
> ------------------------------------
>
>                 Key: ACCUMULO-3914
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3914
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.8.0
>
>         Attachments: ACCUMULO-3914-001.patch
>
>
> I was running random walk to test 1.6.3 RC1.   I had an incorrect hdfs config.  I changed
the hdfs config and restarted hdfs while the test was running.   I would not have expected
this to cause problems, but it caused scans to fail.
> Below are client logs from RW.
> {noformat}
> 23 14:37:36,547 [randomwalk.Framework] ERROR: Error during random walk
> java.lang.Exception: Error running node Conditional.xml
>         at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:344)
>         at org.apache.accumulo.test.randomwalk.Framework.run(Framework.java:63)
>         at org.apache.accumulo.test.randomwalk.Framework.main(Framework.java:122)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.accumulo.start.Main$1.run(Main.java:141)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Error running node ct.Transfer
>         at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:344)
>         at org.apache.accumulo.test.randomwalk.Module$1.call(Module.java:281)
>         at org.apache.accumulo.test.randomwalk.Module$1.call(Module.java:276)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         ... 1 more
> Caused by: java.lang.RuntimeException: org.apache.accumulo.core.client.impl.AccumuloServerException:
Error on server worker0:9997
>         at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:187)
>         at org.apache.accumulo.core.client.IsolatedScanner$RowBufferingIterator.readRow(IsolatedScanner.java:69)
>         at org.apache.accumulo.core.client.IsolatedScanner$RowBufferingIterator.<init>(IsolatedScanner.java:148)
>         at org.apache.accumulo.core.client.IsolatedScanner.iterator(IsolatedScanner.java:236)
>         at org.apache.accumulo.test.randomwalk.conditional.Transfer.visit(Transfer.java:91)
>         ... 10 more
> Caused by: org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server
worker0:9997
>         at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:287)
>         at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:84)
>         at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:177)
>         ... 14 more
> Caused by: org.apache.thrift.TApplicationException: Internal error processing startScan
>         at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
>         at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startScan(TabletClientService.java:228)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startScan(TabletClientService.java:204)
>         at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:403)
>         at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:279)
>         ... 16 more
> {noformat}
> Below are logs from the tserver.
> {noformat}
> 2015-06-23 14:37:36,553 [thrift.ProcessFunction] ERROR: Internal error processing startScan
> org.apache.thrift.TException: java.util.concurrent.ExecutionException: java.net.ConnectException:
Call From worker0/10.1.5.184 to leader2:10000 failed on connection exception: java.net.ConnectException:
Connection refused; For more detai
> ls see:  http://wiki.apache.org/hadoop/ConnectionRefused
>         at org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:51)
>         at com.sun.proxy.$Proxy17.startScan(Unknown Source)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startScan.getResult(TabletClientService.java:2179)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startScan.getResult(TabletClientService.java:2163)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>         at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168)
>         at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
>         at org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:745)
> 2015-06-23 14:37:36,556 [tserver.TabletServer] WARN : exception while scanning tablet
b;b174<
> java.net.ConnectException: Call From worker0/10.1.5.184 to leader2:10000 failed on connection
exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254)
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220)
>         at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210)
>         at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1200)
>         at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:271)
>         at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:238)
>         at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:231)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1498)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:302)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>         at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:263)
>         at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$100(CachableBlockFile.java:144)
>         at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$RawBlockLoader.get(CachableBlockFile.java:195)
>         at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:320)
>         at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getDataBlock(CachableBlockFile.java:400)
>         at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.getDataBlock(RFile.java:590)
>         at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader._seek(RFile.java:715)
>         at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.seek(RFile.java:607)
>         at org.apache.accumulo.core.iterators.system.LocalityGroupIterator.seek(LocalityGroupIterator.java:138)
>         at org.apache.accumulo.core.file.rfile.RFile$Reader.seek(RFile.java:980)
>         at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:135)
>         at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:182)
>         at org.apache.accumulo.server.problems.ProblemReportingIterator.seek(ProblemReportingIterator.java:94)
>         at org.apache.accumulo.core.iterators.system.MultiIterator.seek(MultiIterator.java:105)
>         at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.system.StatsIterator.seek(StatsIterator.java:64)
>         at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.system.DeletingIterator.seek(DeletingIterator.java:67)
>         at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.SkippingIterator.seek(SkippingIterator.java:42)
>         at org.apache.accumulo.core.iterators.system.ColumnFamilySkippingIterator.seek(ColumnFamilySkippingIterator.java:123)
>         at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.Filter.seek(Filter.java:64)
>         at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.Filter.seek(Filter.java:64)
>         at org.apache.accumulo.core.iterators.system.SynchronizedIterator.seek(SynchronizedIterator.java:56)
>         at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.user.VersioningIterator.seek(VersioningIterator.java:81)
>         at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:135)
>         at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:182)
>         at org.apache.accumulo.tserver.Tablet.nextBatch(Tablet.java:1664)
>         at org.apache.accumulo.tserver.Tablet.access$3200(Tablet.java:174)
>         at org.apache.accumulo.tserver.Tablet$Scanner.read(Tablet.java:1804)
>         at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$NextBatchTask.run(TabletServer.java:1081)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
>         at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>         at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>         ... 62 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message