apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "McCullough, Alex" <Alex.McCullo...@capitalone.com>
Subject Root Cause of App Failed Status
Date Wed, 11 May 2016 14:05:57 GMT
Hey Everyone,

I have an application that is “failing” after running for a number of hours,  I was wondering
if there is a standard way to determine the cause for failure.

In STRAM events I see some final exceptions on containers at the end related to loss of socket
ownership, when I click on the operator and look at the logs the last thing logged is a different
error, both are listed below.

In the app master logs I even see a different error.

Is there a best practice to determine why an application becomes “Failed”? And any insight
on the exceptions below?

Thanks,
Alex


App Master Log Final Lines

2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception for block BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797
java.io.IOException: Bad response ERROR for block BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797
from datanode DatanodeInfoWithStorage[10.24.28.58:50010,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1002)
2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797
in pipeline DatanodeInfoWithStorage[10.24.28.56:50010,DS-3664dd2d-8bf2-402a-badb-2016bce2c642,DISK],
DatanodeInfoWithStorage[10.24.28.63:50010,DS-6c2824a3-a9f1-4cef-b3f2-4069e3a596e7,DISK], DatanodeInfoWithStorage[10.24.28.58:50010,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]:
bad datanode DatanodeInfoWithStorage[10.24.28.58:50010,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]
2016-05-10 20:20:37,646 ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close inode 96057235
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1264)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1234)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1375)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)



Error displayed for several HDHT operators in STRAM Events:

Stopped running due to an exception. com.datatorrent.netlet.NetletThrowable$NetletRuntimeException:
java.lang.UnsupportedOperationException: Client does not own the socket any longer!
        at com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:364)
        at com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:354)
        at com.datatorrent.netlet.AbstractClient.send(AbstractClient.java:300)
        at com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:236)
        at com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:190)
        at com.datatorrent.stram.stream.BufferServerPublisher.put(BufferServerPublisher.java:135)
        at com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:51)
        at com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:92)
        at com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:89)
        at com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter.processTuple(AbstractTimedHdhtRecordWriter.java:78)
        at com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:85)
        at com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:82)
        at com.datatorrent.api.DefaultInputPort.put(DefaultInputPort.java:79)
        at com.datatorrent.stram.stream.BufferServerSubscriber$BufferReservoir.sweep(BufferServerSubscriber.java:265)
        at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:252)
        at com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1388)
Caused by: java.lang.UnsupportedOperationException: Client does not own the socket any longer!
        ... 16 more

Last lines in one of the stopped containers with above exception:
2016-05-11 08:09:43,044 WARN com.datatorrent.stram.RecoverableRpcProxy: RPC failure, attempting
reconnect after 10000 ms (remaining 29498 ms)
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.datatorrent.stram.RecoverableRpcProxy.invoke(RecoverableRpcProxy.java:138)
at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source)
at com.datatorrent.stram.engine.StreamingContainer.heartbeatLoop(StreamingContainer.java:693)
at com.datatorrent.stram.engine.StreamingContainer.main(StreamingContainer.java:312)
Caused by: java.io.EOFException: End of File Exception between local host is: "mdcilabpdn04.kdc.capitalone.com/10.24.28.53";
destination host is: "mdcilabpdn06.kdc.capitalone.com":49859; : java.io.EOFException; For
more details see: http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1075)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:970)
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message