reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geon-Woo Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1040) Fix a bug in WatcherTest
Date Thu, 18 Feb 2016 07:18:18 GMT

    [ https://issues.apache.org/jira/browse/REEF-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151849#comment-15151849
] 

Geon-Woo Kim commented on REEF-1040:
------------------------------------

I've found the root cause. As we investigated in the previous [PR|https://github.com/apache/reef/pull/702],
the failure seems to come from a hidden bug in suspend-resubmit.

In {{WatcherTestDriver}} a task is immediately suspended when the {{RunningTask}} event is
called, but the suspend request is ignored with the log at {{TaskRuntime:219}} ("Trying to
suspend a task that is in state: INIT. Ignoring."). That means, the evaluator-side task state
can remain as INIT even after the {{RunningTask}} event in the driver. The code level failure
scenario follows below.

[driver side] 1. A task is submitted ({{WatcherTestDriver#TaskFailedHandler:165}}).
[evaluator side] 2. A {{TaskRuntime}} instance is instantiated and initialized({{ContextRuntime:257}}).
It sets the state of Task as INIT and sends its status (task_status) via heartbeat.
[driver side] 3 - 1. The sent task status has INIT state so {{TaskRepresenter.onTaskInit}}
is called, and as a result RunningTask handlers are executed({{TaskRepresenter.onTaskInit:130}}).
The task is requested to be suspended in {{WatcherTestDriver#TaskRunningHandler}}.
[evaluator side] 3 - 2. Concurrently, following {{ContextRuntime:257}} in 2., the evaluator-side
task state is set to RUNNING at {{TaskRuntime.run:134}} and our {{WatcherTestTask}} is executed.

If the suspend request in 3 - 1 is faster than the state transition from INIT to RUNNING in
3 - 2, the suspended request is ignored and the task which needs the suspend event to stop
never returns.

The main problem is that the driver regards a task is running when the task sends INIT message(not
RUNNING message). This makes {{RunningTask}} handlers can be executed even when the evaluator-side
task status is not RUNNING. We can address the issue by guaranteeing {{RunningTask}} event
can be called only if corresponding status in the evaluator is RUNNING.

I've created a [new PR|https://github.com/apache/reef/pull/845] that makes {{RunningTask}}
event handlers be called when the first RUNNING message arrived, but I'm not sure it would
be a proper solution. What are you thinking about [~bgchun], [~markus.weimer]  [~MariiaMykhailova]?

Thanks.

> Fix a bug in WatcherTest
> ------------------------
>
>                 Key: REEF-1040
>                 URL: https://issues.apache.org/jira/browse/REEF-1040
>             Project: REEF
>          Issue Type: Sub-task
>          Components: REEF-IO
>            Reporter: Geon-Woo Kim
>            Priority: Blocker
>
> Watcher tests sporadically fails especially in Travis CI. The bug should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message