reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1291) Driver gets FailedEvaluator message even if evaluator shuts down properly
Date Wed, 30 Mar 2016 21:25:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218891#comment-15218891
] 

Dhruv Mahajan commented on REEF-1291:
-------------------------------------

This is the java log...On C# side I get this containerId in IFailedEvaluator handler.

> Driver gets FailedEvaluator message even if evaluator shuts down properly
> -------------------------------------------------------------------------
>
>                 Key: REEF-1291
>                 URL: https://issues.apache.org/jira/browse/REEF-1291
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF.NET
>         Environment: C#
>            Reporter: Dhruv Mahajan
>
> It seems that there is some sort of race condition when evaluator is completed....The
evaluator says that it is done and exits while driver gets the same message as if RM has taken
the container away. 
> To be more concrete, end of evaluator log and relevant driver log are shown below. One
can clearly see that shut down appeared to be clean but driver got failed evaluator message.The
driver log shows that stream on which it was reading message from evaluator was closed probably
because evaluator closed it. A simple fix would be that if driver expects this evaluator to
shutdown and it gets this message then it assumes that shutdown is complete.
> *End of Evaluator Log*
> INFO: Handle Evaluator control message
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: Send task control message to ContextManager
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: RemoveContext with id DataLoading-container_e12_1458013582010_0048_01_000750
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> WARNING: Shutting down an task because the underlying context is being closed.
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: Trying to close Task IMRUMap-Id297-Version0
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> WARNING: Trying to close an task that is in Done state. Ignored.
> Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: DefaultContextStopHandler received for context: DataLoading-container_e12_1458013582010_0048_01_000750
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: Context stack is empty, done
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: Evaluator state : DONE
> Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
> INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[],
evaluator_status=[DONE], context_status=[], timestamp=[1459201619420], recoveryFlag =[False].
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0001
> INFO: Runtime stop
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4364238-07:00
0001
> INFO: EvaluatorRuntime shutdown complete
> Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2016-03-28T14:46:59.4364238-07:00
0001
> INFO: Evaluator is returned from Run()
> End of LogType:evaluator.stdout
> *Relevant driver log*
> INFO: Received a JobStatus message that can't be sent:
> identifier: "BroadcastReduce"
> state: RUNNING
> message: "Evaluator container_e12_1458013582010_0048_01_000750 failed with message: Evaluator
[container_e12_1458013582010_0048_01_000750] is assumed to be in state [RUNNING]. But the
resource manager reports it to be in state [DONE]. This means that the Evaluator failed but
wasn\'t able to send an error message back to the driver."
> Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler
exceptionCaught
> INFO: Unexpected exception from downstream. channel: [id: 0x51bdfbfb, /10.200.144.38:64266
=> /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.144.38:64266
> Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener
exceptionCaught
> WARNING: ExceptionEvent: local: /10.200.144.168:8959 remote: /10.200.144.247:53406 ::
java.io.IOException: An existing connection was forcibly closed by the remote host
> Mar 28, 2016 2:47:00 PM org.apache.reef.javabridge.generic.JobDriver$CompletedEvaluatorHandler
onNext
> INFO: Handling the event of completed evaluator in CLR bridge.
> Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler
exceptionCaught
> INFO: Unexpected exception from downstream. channel: [id: 0xdc25e349, /10.200.145.206:64394
=> /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.145.206:64394
> Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler
exceptionCaught
> WARNING: Unexpected exception from downstream.
> java.io.IOException: An existing connection was forcibly closed by the remote host
> at sun.nio.ch.SocketDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446)
> at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:879)
> at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message