reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1778) The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.
Date Mon, 17 Apr 2017 16:33:41 GMT

    [ https://issues.apache.org/jira/browse/REEF-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971307#comment-15971307
] 

Markus Weimer commented on REEF-1778:
-------------------------------------

Thanks for reporting this!

I believe your analysis is spot-on. The `Task.Dispose()` method is guaranteed to complete
right now. The sequence of events at the end of an IMRU Job is as follows, IIRC:

  * {{UpdateTaskHost}} determines that the job is done and {{return}}s.
  * On a Thread, {{TaskRuntime}} sends the {{ITask}}'s return value to {{TaskStatus.SetResult()}}
({{TaskRuntime.cs}} line 103) which in turn triggers a heartbeat ({{TaskStatus.cs}} line 127.

  * Receipt of this heartbeat causes the Driver to shutdown the job.
  * Meanwhile, Thread mentioned above is still busy working through the lines after line 103
in {{TaskRuntime.cs}}, which includes calling the {{Dispose}} method on the {{ITask}}, which
eventually calls the {{Dispose}} methods in IMRU Tasks.

These last two steps create a race condition. If the roundtrip to the Driver and back is faster
than the work done in {{TaskRuntime.cs}} post-line 103, the Thread could be rudely interrupted.

Does this sound like a reasonable explanation for this bug? If so, what ideas do we have to
remedy it? My first idea is to restructure {{TaskRuntime}} such that {{Dispose}} is called
before a result is send to the Driver. Any other ideas?




> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.
> ------------------------------------------------------------------------------------------
>
>                 Key: REEF-1778
>                 URL: https://issues.apache.org/jira/browse/REEF-1778
>             Project: REEF
>          Issue Type: Bug
>          Components: IMRU, REEF.NET
>         Environment: YARN. Does not affect the local runtime.
>            Reporter: Rogan Carr
>
> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes
when run on YARN. When running locally, this isn't an issue. I have not tested a Mesos runtime.
> My hunch is that YARN is killing the container before Dispose() finishes. If I had to
guess, I think that the driver calls Dispose() is on the UpdateTaskHost (which calls it on
the IIMRUResultHandler), and then the driver tells YARN that the entire job is done, and then
YARN kills all the containers before they have a chance to dispose.
> Whatever the case is, this looks like a race condition. I occasionally do see my Dispose()
actions successfully complete.
> -----------------------------------------
> [1] Log showing that IIMRUResultHandler.Dispose() did not finish. Here Dispose() is a
for loop that writes a number and waits a second. It is supposed to count to 100, but only
counts till 9 before the logs end.
> {noformat}
> ...
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage
Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Control protobuf to remove context MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Handle Evaluator control message:  done_evaluator = null kill_evaluator = null
stop_evaluator = null context_control = not null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Send task control message to ContextManager
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: RemoveContext with id MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> WARNING: Shutting down an task because the underlying context is being closed.
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Trying to close Task IMRUMaster-0
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> WARNING: Trying to close an task that is in Done state. Ignored.
> Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: DefaultContextStopHandler received for context: MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Context stack is empty, done
> Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00
0020
> INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[],
evaluator_status=[DONE], context_status=[], time
> stamp=[1492184889561], recoveryFlag =[False].
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0005
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage
Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0005
> INFO: Handle Evaluator control message:  done_evaluator = not null kill_evaluator = null
stop_evaluator = null context_control = null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0005
> INFO: Received ACK from Driver, shutting down Evaluator.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorExitLogger Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0001
> INFO: The Evaluator has successfully exited.
> Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2017-04-14T15:48:09.5929225+00:00
0001
> INFO: Evaluator is returned from Run()
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:10.0460051+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 1
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:11.0615866+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 2
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:12.0771779+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 3
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:13.0927401+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 4
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:14.1083243+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 5
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:15.1239021+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 6
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:16.1394801+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 7
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:17.1550601+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 8
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:18.1706423+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 9
> End of LogType:evaluator.stdout
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message