flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4152) TaskManager registration exponential backoff doesn't work
Date Fri, 15 Jul 2016 13:55:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379396#comment-15379396

ASF GitHub Bot commented on FLINK-4152:

GitHub user tillrohrmann opened a pull request:


    [FLINK-4152] Allow re-registration of TMs at resource manager

    The `YarnFlinkResourceManager` does not allow the `JobManager` to re-register task managers
have had been registered at the resource manager before. The consequence of the refusal is
that the job manager rejects the registrations of these task managers. 
    Such a scenario can happen if a `JobManager` loses leadership in an HA setting after it
    some TMs. The old behaviour was that the resource manager clears the list of registered
workers and only accepts new registrations of task manager which have been started in a fresh
container. However, in case that the previously registered TMs didn't die, they will try to
reconnect to the new leader. The new leader will then ask the resource manager whether the
TMs represent valid resources. Since the resource manager forgot about the already started
containers, it rejects the TMs.
    This PR changes the behaviour of the resource manager such that it can no longer reject
TMs. Instead of being asked it will simply be informed about the registered TMs by the JM.
If the TM happens to be running in a container which was started by the RM, then it will monitor
this container. In case that this container dies, the RM will notify the JM about the death
of the TM.
    In that sense, the RM has no longer the authority to interfere with the JM-TM interactions
and instead it is simply used as an additional monitoring service to detect dead TMs as a
result of a failed container.
    Furthermore, the PR adds a de-duplication method to filter out concurrent registration
runs on the task manager. Before, it happened that a `RefusedRegistration` triggers a new
registration run without cancelling the old registration run. This could lead to a massive
amount of registration messages if the TaskManager's registration was refused multiple times.
    The mechanism to de-duplicate `TriggerTaskManagerRegistration` works by assigning a registration
run id which is changed whenever a new registration run is started. `TriggerTaskManagerRegistration`
messages which have an outdated registration run id are then filtered out.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink FLINK-4152_YarnResourceManager

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2257
commit 2a29fa4df98fd63a7f27e1607152cd6e54d25ad1
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2016-07-15T08:51:59Z

    Add YarnFlinkResourceManager test to reaccept task manager registrations from a re-elected
job manager

commit 6462d4750d79512fc93bbc60ca754c99142d1794
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2016-07-15T09:50:35Z

    Remove unnecessary sync logic between JobManager and ResourceManager

commit 55ce0c01783d9c4927a9f4677309c805e2b624e7
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2016-07-15T10:12:12Z

    Avoid duplicate reigstration attempts in case of a refused registration

commit aeb6ae5435dd88c640dc314b9ec31357815d080c
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2016-07-15T13:04:54Z

    Add test case to check that not an excessive amount of RegisterTaskManager messages are


> TaskManager registration exponential backoff doesn't work
> ---------------------------------------------------------
>                 Key: FLINK-4152
>                 URL: https://issues.apache.org/jira/browse/FLINK-4152
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, TaskManager, YARN Client
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>         Attachments: logs.tgz
> While testing Flink 1.1 I've found that the TaskManagers are logging many messages when
registering at the JobManager.
> This is the log file: https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that this is
the expected behavior.

This message was sent by Atlassian JIRA

View raw message