reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <>
Subject [jira] [Commented] (REEF-1480) Increase the retry count for task registration to high value
Date Thu, 21 Jul 2016 00:38:20 GMT


Dhruv Mahajan commented on REEF-1480:

[~shravanmn] and I discussed this a bit offline. Summarizing for everybody.

Regarding the data download being executed in parallel: This should really be done

We can either use Cache() function of IInputPartition, and modify {{FileBasedInputPartition}}
to download asynchronously, or simply call {{GetPartitionHandle}} for first time in context
handler, in FileBAsedInputPartition we will make it return null first time and then for later
calls it will wait. For users writing their own {{IInputPartition}}, they would need to be
aware of this logic. I personally prefer using {{Cache()}}, but that API is marked as unstable.

For the group-communication set-up and registration, why do we have a wait with a time out?
We can actually do this also through a context right?

Agreed this is a good solution. One point to note though: In IMRU FT work REEF-1251 , we are
currently assuming a root context that downloads data and then task on top of it. Currently
group comm. configuration is merged with task conf. and then instantiated as part of task.
With this we will have another context on top of root context and failures there would need
to be handled also which will require decent amount of change in IMRU FT driver design too.
All relevant people working on REEF-1251 need to be aware of this.

[~markus.weimer] [~shravanmn] thoughts on this.

Also looping in [~juliaw] [~MariiaMykhailova] [~andreym], since they are working on REEF-1251
and need to be aware of this.

> Increase the retry count for task registration to high value
> ------------------------------------------------------------
>                 Key: REEF-1480
>                 URL:
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF.NET
>         Environment: C#
>            Reporter: Dhruv Mahajan
> Currently, the default retry count in Group communication to wait for registration is
set so that error is thrown after around 4 minutes. For IMRU tasks, if data downloading takes
a lot of time error gets thrown. In general this can be the issue for any other application
also since it is too lower level parameter to expose via application interfaces, for example
{{IMRUJobDefinition}}. Like hadoop MapReduce, we can take a configuration file and then read
these parameters from over there. For now, we would like to set the default to a very high

This message was sent by Atlassian JIRA

View raw message