apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Munagala Ramanath <...@datatorrent.com>
Subject Re: Root Cause of App Failed Status
Date Wed, 11 May 2016 15:07:25 GMT
Alex, what version of the platform are you running ?

Ram

On Wed, May 11, 2016 at 7:05 AM, McCullough, Alex <
Alex.McCullough@capitalone.com> wrote:

> Hey Everyone,
>
> I have an application that is “failing” after running for a number of
> hours,  I was wondering if there is a standard way to determine the cause
> for failure.
>
> In STRAM events I see some final exceptions on containers at the end
> related to loss of socket ownership, when I click on the operator and look
> at the logs the last thing logged is a different error, both are listed
> below.
>
> In the app master logs I even see a different error.
>
> Is there a best practice to determine why an application becomes “Failed”?
> And any insight on the exceptions below?
>
> Thanks,
> Alex
>
>
> App Master Log Final Lines
>
> 2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception for block
> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797
> java.io.IOException: Bad response ERROR for block
> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 from datanode
> DatanodeInfoWithStorage[10.24.28.58:50010
> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1002)
> 2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block
> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 in pipeline
> DatanodeInfoWithStorage[10.24.28.56:50010,DS-3664dd2d-8bf2-402a-badb-2016bce2c642,DISK],
> DatanodeInfoWithStorage[10.24.28.63:50010,DS-6c2824a3-a9f1-4cef-b3f2-4069e3a596e7,DISK],
> DatanodeInfoWithStorage[10.24.28.58:50010,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]:
> bad datanode DatanodeInfoWithStorage[10.24.28.58:50010
> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]
> 2016-05-10 20:20:37,646 ERROR org.apache.hadoop.hdfs.DFSClient: Failed to
> close inode 96057235
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1264)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1234)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1375)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)
>
>
>
> Error displayed for several HDHT operators in STRAM Events:
>
> Stopped running due to an exception.
> com.datatorrent.netlet.NetletThrowable$NetletRuntimeException:
> java.lang.UnsupportedOperationException: Client does not own the socket any
> longer!
>         at
> com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:364)
>         at
> com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:354)
>         at
> com.datatorrent.netlet.AbstractClient.send(AbstractClient.java:300)
>         at
> com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:236)
>         at
> com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:190)
>         at
> com.datatorrent.stram.stream.BufferServerPublisher.put(BufferServerPublisher.java:135)
>         at
> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:51)
>         at
> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:92)
>         at
> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:89)
>         at
> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter.processTuple(AbstractTimedHdhtRecordWriter.java:78)
>         at
> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:85)
>         at
> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:82)
>         at
> com.datatorrent.api.DefaultInputPort.put(DefaultInputPort.java:79)
>         at
> com.datatorrent.stram.stream.BufferServerSubscriber$BufferReservoir.sweep(BufferServerSubscriber.java:265)
>         at
> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:252)
>         at
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1388)
> Caused by: java.lang.UnsupportedOperationException: Client does not own
> the socket any longer!
>         ... 16 more
>
> Last lines in one of the stopped containers with above exception:
> 2016-05-11 08:09:43,044 WARN com.datatorrent.stram.RecoverableRpcProxy:
> RPC failure, attempting reconnect after 10000 ms (remaining 29498 ms)
> java.lang.reflect.UndeclaredThrowableException
> at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> com.datatorrent.stram.RecoverableRpcProxy.invoke(RecoverableRpcProxy.java:138)
> at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source)
> at
> com.datatorrent.stram.engine.StreamingContainer.heartbeatLoop(StreamingContainer.java:693)
> at
> com.datatorrent.stram.engine.StreamingContainer.main(StreamingContainer.java:312)
> Caused by: java.io.EOFException: End of File Exception between local host
> is: "mdcilabpdn04.kdc.capitalone.com/10.24.28.53"; destination host is: "
> mdcilabpdn06.kdc.capitalone.com":49859; : java.io.EOFException; For more
> details see: http://wiki.apache.org/hadoop/EOFException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
> ... 8 more
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1075)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:970)
> ________________________________________________________
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message