reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1778) The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.
Date Mon, 17 Apr 2017 23:07:41 GMT

    [ https://issues.apache.org/jira/browse/REEF-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971793#comment-15971793
] 

Markus Weimer commented on REEF-1778:
-------------------------------------

{quote}
Of course, we must test it for IMRU extensively after the change.
{quote}

[~juliaw] Shall I post the change, then, and you do the testing on the cluster?

> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.
> ------------------------------------------------------------------------------------------
>
>                 Key: REEF-1778
>                 URL: https://issues.apache.org/jira/browse/REEF-1778
>             Project: REEF
>          Issue Type: Bug
>          Components: IMRU, REEF.NET
>         Environment: YARN. Does not affect the local runtime.
>            Reporter: Rogan Carr
>
> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes
when run on YARN. When running locally, this isn't an issue. I have not tested a Mesos runtime.
> My hunch is that YARN is killing the container before Dispose() finishes. If I had to
guess, I think that the driver calls Dispose() is on the UpdateTaskHost (which calls it on
the IIMRUResultHandler), and then the driver tells YARN that the entire job is done, and then
YARN kills all the containers before they have a chance to dispose.
> Whatever the case is, this looks like a race condition. I occasionally do see my Dispose()
actions successfully complete.
> -----------------------------------------
> [1] Log showing that IIMRUResultHandler.Dispose() did not finish. Here Dispose() is a
for loop that writes a number and waits a second. It is supposed to count to 100, but only
counts till 9 before the logs end.
> {noformat}
> ...
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage
Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Control protobuf to remove context MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Handle Evaluator control message:  done_evaluator = null kill_evaluator = null
stop_evaluator = null context_control = not null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Send task control message to ContextManager
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: RemoveContext with id MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> WARNING: Shutting down an task because the underlying context is being closed.
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Trying to close Task IMRUMaster-0
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> WARNING: Trying to close an task that is in Done state. Ignored.
> Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: DefaultContextStopHandler received for context: MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Context stack is empty, done
> Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[],
evaluator_status=[DONE], context_status=[], time
> stamp=[1492184889561], recoveryFlag =[False].
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0005
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage
Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0005
> INFO: Handle Evaluator control message:  done_evaluator = not null kill_evaluator = null
stop_evaluator = null context_control = null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0005
> INFO: Received ACK from Driver, shutting down Evaluator.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorExitLogger Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0001
> INFO: The Evaluator has successfully exited.
> Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0001
> INFO: Evaluator is returned from Run()
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:10.0460051+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 1
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:11.0615866+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 2
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:12.0771779+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 3
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:13.0927401+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 4
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:14.1083243+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 5
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:15.1239021+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 6
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:16.1394801+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 7
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:17.1550601+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 8
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:18.1706423+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 9
> End of LogType:evaluator.stdout
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message