reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1480) Increase the retry count for task registration to high value
Date Tue, 05 Jul 2016 17:31:11 GMT


Markus Weimer commented on REEF-1480:

If we expect user to always do that then may be

What do you mean? I suspect long latency to error to be an issue for just about any use case.
Or am I missing something?

I don't think a configuration option is the right solution here. We cannot reasonably expect
users to tune this parameter based on how fast their cluster is, how much data they need to
load and how many containers the job may need. Let's make this automatic. 

One step is to put data download and worker registration on different threads. That should
remove the need for users to reason about their per-machine data download times.

Which leaves the number of containers as the second concern. What time scales with the number
of containers? Is it a matter of wiring up the topology? If so, couldn't this be done separately
from the Task registration with the name server?

> Increase the retry count for task registration to high value
> ------------------------------------------------------------
>                 Key: REEF-1480
>                 URL:
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF.NET
>         Environment: C#
>            Reporter: Dhruv Mahajan
> Currently, the default retry count in Group communication to wait for registration is
set so that error is thrown after around 4 minutes. For IMRU tasks, if data downloading takes
a lot of time error gets thrown. In general this can be the issue for any other application
also since it is too lower level parameter to expose via application interfaces, for example
{{IMRUJobDefinition}}. Like hadoop MapReduce, we can take a configuration file and then read
these parameters from over there. For now, we would like to set the default to a very high

This message was sent by Atlassian JIRA

View raw message