reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <>
Subject [jira] [Commented] (REEF-1224) IMRU Fault Tolerance - Separate Data downloading from Task injection
Date Mon, 28 Mar 2016 22:21:25 GMT


Dhruv Mahajan commented on REEF-1224:

One more question? what I am observing is some sort of race condition when evaluator is completed....I
saw that the evaluator says that it is done and exits while driver gets the same message as
if RM has taken the container away. i.e. 

Mar 28, 2016 2:46:59 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager onEvaluatorException
WARNING: Failed evaluator: container_e12_1458013582010_0048_01_000327
org.apache.reef.exception.EvaluatorException: Evaluator [container_e12_1458013582010_0048_01_000327]
is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [DONE].
This means that the Evaluator failed but wasn't able to send an error message back to the
	at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(
	at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
	at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
	at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(
	at org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(
	at org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$

This seems like a bug right? Note that I do not have binding to {{ICompletedEvaluator}} in
driver. Is it needed so that these bugs do not come?

> IMRU Fault Tolerance - Separate Data downloading from Task injection
> --------------------------------------------------------------------
>                 Key: REEF-1224
>                 URL:
>             Project: REEF
>          Issue Type: Improvement
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Dhruv Mahajan
> Currently in IMRU, data downloading happens during the Task injection. It couples the
data and Task object. In Fault tolerant case, we would like to only resubmit a task but use
the data that have been downloaded, That requires us to decouple those two portions. For example,
data downloading portion can be attached to Context, and we can then resubmit a task on the
same context. 

This message was sent by Atlassian JIRA

View raw message