flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4152) TaskManager registration exponential backoff doesn't work
Date Wed, 13 Jul 2016 16:25:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375296#comment-15375296

Till Rohrmann commented on FLINK-4152:

[~mxm]The restarted registration attempts are the observable symptoms caused by a different

The actual problem is that the {{YarnFlinkRessourceManager}} forgets about the registered
task managers if the job manager loses its leadership. Each task manager has a resource ID
with which it registers at the resource manager. The {{YarnFlinkResourceManager}} has two
states for allocated resources: {{containersInLaunch}} and {{registeredWorkers}}. A container
can only go from {{containersInLaunch}} to {{registeredWorkers}}. This also works for the
initial registration. However, when the job manager loses its leadership and the {{registeredWorkers}}
list is cleared, there is no longer an container in launch associated with the respective
resource ID. Consequently, when the old task manager is being re-registered by the new leader,
the registration is rejected.

This rejection is then sent to the task manager. Upon receiving a rejection, the task manager
reschedules another registration attempt after waiting for some time. Here the problem is
that the old registration attempts are not cancelled. Consequently, one will have multiple
registration attempts taking place at the "same" time/concurrently. That's the reason why
you observe many registration attempt messages in the log.

I think the symptom can be fixed by cancelling all currently active registration attempts
when you want to restart the registration.

It is a bit unclear to me what the expected behaviour of the FlinkYarnResourceManager should
be. In the {{jobManagerLostLeadership}} method where the {{registeredWorkers}} list is cleared,
a comment says "all currently registered TaskManagers are put under "awaiting registration"".
But there is no such state. Furthermore, I'm not sure whether registered TaskManagers have
to re-register if only the job manager has failed.

Thus, I see two solutions. Either not clearing {{registeredWorkers}} or introducing a new
state "awaiting registration" which keeps all formerly registered task managers which can
be re-registered.

Maybe [~mxm] can give some input.

> TaskManager registration exponential backoff doesn't work
> ---------------------------------------------------------
>                 Key: FLINK-4152
>                 URL: https://issues.apache.org/jira/browse/FLINK-4152
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, TaskManager, YARN Client
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>         Attachments: logs.tgz
> While testing Flink 1.1 I've found that the TaskManagers are logging many messages when
registering at the JobManager.
> This is the log file: https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that this is
the expected behavior.

This message was sent by Atlassian JIRA

View raw message