Return-Path: X-Original-To: apmail-reef-dev-archive@minotaur.apache.org Delivered-To: apmail-reef-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C7710185B2 for ; Mon, 28 Mar 2016 23:20:25 +0000 (UTC) Received: (qmail 27466 invoked by uid 500); 28 Mar 2016 23:20:25 -0000 Delivered-To: apmail-reef-dev-archive@reef.apache.org Received: (qmail 27430 invoked by uid 500); 28 Mar 2016 23:20:25 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 27329 invoked by uid 99); 28 Mar 2016 23:20:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Mar 2016 23:20:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8481B2C1F5A for ; Mon, 28 Mar 2016 23:20:25 +0000 (UTC) Date: Mon, 28 Mar 2016 23:20:25 +0000 (UTC) From: "Dhruv Mahajan (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (REEF-1224) IMRU Fault Tolerance - Separate Data downloading from Task injection MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/REEF-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215077#comment-15215077 ] Dhruv Mahajan commented on REEF-1224: ------------------------------------- To be more concrete, I am writing end of evaluator log and relevant driver log. You can clearly see that shut down appeared to be clean but driver got failed evaluator message.The driver log shows that stream on which it was reading message from evaluator was closed probably because evaluator closed it. A simple fix would be that if driver expects this evaluator to shutdown and it gets this message then it assumes that shutdown is complete. *End of Evaluator Log* INFO: Handle Evaluator control message Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: Send task control message to ContextManager Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: RemoveContext with id DataLoading-container_e12_1458013582010_0048_01_000750 Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 WARNING: Shutting down an task because the underlying context is being closed. Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: Trying to close Task IMRUMap-Id297-Version0 Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 WARNING: Trying to close an task that is in Done state. Ignored. Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: DefaultContextStopHandler received for context: DataLoading-container_e12_1458013582010_0048_01_000750 Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: Context stack is empty, done Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: Evaluator state : DONE Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], timestamp=[1459201619420], recoveryFlag =[False]. Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0001 INFO: Runtime stop Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001 INFO: EvaluatorRuntime shutdown complete Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001 INFO: Evaluator is returned from Run() End of LogType:evaluator.stdout *Relevant driver log* INFO: Received a JobStatus message that can't be sent: identifier: "BroadcastReduce" state: RUNNING message: "Evaluator container_e12_1458013582010_0048_01_000750 failed with message: Evaluator [container_e12_1458013582010_0048_01_000750] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [DONE]. This means that the Evaluator failed but wasn\'t able to send an error message back to the driver." Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught INFO: Unexpected exception from downstream. channel: [id: 0x51bdfbfb, /10.200.144.38:64266 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.144.38:64266 Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener exceptionCaught WARNING: ExceptionEvent: local: /10.200.144.168:8959 remote: /10.200.144.247:53406 :: java.io.IOException: An existing connection was forcibly closed by the remote host Mar 28, 2016 2:47:00 PM org.apache.reef.javabridge.generic.JobDriver$CompletedEvaluatorHandler onNext INFO: Handling the event of completed evaluator in CLR bridge. Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught INFO: Unexpected exception from downstream. channel: [id: 0xdc25e349, /10.200.145.206:64394 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.145.206:64394 Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught WARNING: Unexpected exception from downstream. java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:879) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) > IMRU Fault Tolerance - Separate Data downloading from Task injection > -------------------------------------------------------------------- > > Key: REEF-1224 > URL: https://issues.apache.org/jira/browse/REEF-1224 > Project: REEF > Issue Type: Improvement > Components: IMRU, REEF.NET > Reporter: Julia > Assignee: Dhruv Mahajan > > Currently in IMRU, data downloading happens during the Task injection. It couples the data and Task object. In Fault tolerant case, we would like to only resubmit a task but use the data that have been downloaded, That requires us to decouple those two portions. For example, data downloading portion can be attached to Context, and we can then resubmit a task on the same context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)