reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1224) IMRU Fault Tolerance - Separate Data downloading from Task injection
Date Mon, 28 Mar 2016 23:20:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215077#comment-15215077
] 

Dhruv Mahajan commented on REEF-1224:
-------------------------------------

To be more concrete, I am writing end of evaluator log and relevant driver log. You can clearly
see that shut down appeared to be clean but driver got failed evaluator message.The driver
log shows that stream on which it was reading message from evaluator was closed probably because
evaluator closed it. A simple fix would be that if driver expects this evaluator to shutdown
and it gets this message then it assumes that shutdown is complete.

*End of Evaluator Log*

INFO: Handle Evaluator control message
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: Send task control message to ContextManager
Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: RemoveContext with id DataLoading-container_e12_1458013582010_0048_01_000750
Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
WARNING: Shutting down an task because the underlying context is being closed.
Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: Trying to close Task IMRUMap-Id297-Version0
Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
WARNING: Trying to close an task that is in Done state. Ignored.
Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: DefaultContextStopHandler received for context: DataLoading-container_e12_1458013582010_0048_01_000750
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: Context stack is empty, done
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: Evaluator state : DONE
Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0085
INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[],
evaluator_status=[DONE], context_status=[], timestamp=[1459201619420], recoveryFlag =[False].
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00
0001
INFO: Runtime stop
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4364238-07:00
0001
INFO: EvaluatorRuntime shutdown complete
Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001
INFO: Evaluator is returned from Run()
End of LogType:evaluator.stdout

*Relevant driver log*

INFO: Received a JobStatus message that can't be sent:
identifier: "BroadcastReduce"
state: RUNNING
message: "Evaluator container_e12_1458013582010_0048_01_000750 failed with message: Evaluator
[container_e12_1458013582010_0048_01_000750] is assumed to be in state [RUNNING]. But the
resource manager reports it to be in state [DONE]. This means that the Evaluator failed but
wasn\'t able to send an error message back to the driver."

Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
INFO: Unexpected exception from downstream. channel: [id: 0x51bdfbfb, /10.200.144.38:64266
=> /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.144.38:64266
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener
exceptionCaught
WARNING: ExceptionEvent: local: /10.200.144.168:8959 remote: /10.200.144.247:53406 :: java.io.IOException:
An existing connection was forcibly closed by the remote host
Mar 28, 2016 2:47:00 PM org.apache.reef.javabridge.generic.JobDriver$CompletedEvaluatorHandler
onNext
INFO: Handling the event of completed evaluator in CLR bridge.
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
INFO: Unexpected exception from downstream. channel: [id: 0xdc25e349, /10.200.145.206:64394
=> /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.145.206:64394
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
WARNING: Unexpected exception from downstream.
java.io.IOException: An existing connection was forcibly closed by the remote host
	at sun.nio.ch.SocketDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:879)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at java.lang.Thread.run(Thread.java:745)


> IMRU Fault Tolerance - Separate Data downloading from Task injection
> --------------------------------------------------------------------
>
>                 Key: REEF-1224
>                 URL: https://issues.apache.org/jira/browse/REEF-1224
>             Project: REEF
>          Issue Type: Improvement
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Dhruv Mahajan
>
> Currently in IMRU, data downloading happens during the Task injection. It couples the
data and Task object. In Fault tolerant case, we would like to only resubmit a task but use
the data that have been downloaded, That requires us to decouple those two portions. For example,
data downloading portion can be attached to Context, and we can then resubmit a task on the
same context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message