Return-Path: X-Original-To: apmail-reef-dev-archive@minotaur.apache.org Delivered-To: apmail-reef-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4521119971 for ; Wed, 30 Mar 2016 21:25:26 +0000 (UTC) Received: (qmail 19189 invoked by uid 500); 30 Mar 2016 21:25:26 -0000 Delivered-To: apmail-reef-dev-archive@reef.apache.org Received: (qmail 19083 invoked by uid 500); 30 Mar 2016 21:25:26 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 19007 invoked by uid 99); 30 Mar 2016 21:25:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Mar 2016 21:25:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id DB7B42C1F5C for ; Wed, 30 Mar 2016 21:25:25 +0000 (UTC) Date: Wed, 30 Mar 2016 21:25:25 +0000 (UTC) From: "Dhruv Mahajan (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (REEF-1291) Driver gets FailedEvaluator message even if evaluator shuts down properly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/REEF-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218891#comment-15218891 ] Dhruv Mahajan commented on REEF-1291: ------------------------------------- This is the java log...On C# side I get this containerId in IFailedEvaluator handler. > Driver gets FailedEvaluator message even if evaluator shuts down properly > ------------------------------------------------------------------------- > > Key: REEF-1291 > URL: https://issues.apache.org/jira/browse/REEF-1291 > Project: REEF > Issue Type: Bug > Components: REEF.NET > Environment: C# > Reporter: Dhruv Mahajan > > It seems that there is some sort of race condition when evaluator is completed....The evaluator says that it is done and exits while driver gets the same message as if RM has taken the container away. > To be more concrete, end of evaluator log and relevant driver log are shown below. One can clearly see that shut down appeared to be clean but driver got failed evaluator message.The driver log shows that stream on which it was reading message from evaluator was closed probably because evaluator closed it. A simple fix would be that if driver expects this evaluator to shutdown and it gets this message then it assumes that shutdown is complete. > *End of Evaluator Log* > INFO: Handle Evaluator control message > Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: Send task control message to ContextManager > Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: RemoveContext with id DataLoading-container_e12_1458013582010_0048_01_000750 > Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > WARNING: Shutting down an task because the underlying context is being closed. > Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: Trying to close Task IMRUMap-Id297-Version0 > Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > WARNING: Trying to close an task that is in Done state. Ignored. > Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: DefaultContextStopHandler received for context: DataLoading-container_e12_1458013582010_0048_01_000750 > Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: Context stack is empty, done > Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: Evaluator state : DONE > Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085 > INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], timestamp=[1459201619420], recoveryFlag =[False]. > Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0001 > INFO: Runtime stop > Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001 > INFO: EvaluatorRuntime shutdown complete > Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001 > INFO: Evaluator is returned from Run() > End of LogType:evaluator.stdout > *Relevant driver log* > INFO: Received a JobStatus message that can't be sent: > identifier: "BroadcastReduce" > state: RUNNING > message: "Evaluator container_e12_1458013582010_0048_01_000750 failed with message: Evaluator [container_e12_1458013582010_0048_01_000750] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [DONE]. This means that the Evaluator failed but wasn\'t able to send an error message back to the driver." > Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught > INFO: Unexpected exception from downstream. channel: [id: 0x51bdfbfb, /10.200.144.38:64266 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.144.38:64266 > Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener exceptionCaught > WARNING: ExceptionEvent: local: /10.200.144.168:8959 remote: /10.200.144.247:53406 :: java.io.IOException: An existing connection was forcibly closed by the remote host > Mar 28, 2016 2:47:00 PM org.apache.reef.javabridge.generic.JobDriver$CompletedEvaluatorHandler onNext > INFO: Handling the event of completed evaluator in CLR bridge. > Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught > INFO: Unexpected exception from downstream. channel: [id: 0xdc25e349, /10.200.145.206:64394 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.145.206:64394 > Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught > WARNING: Unexpected exception from downstream. > java.io.IOException: An existing connection was forcibly closed by the remote host > at sun.nio.ch.SocketDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:879) > at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) > at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)