flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1352) Buggy registration from TaskManager to JobManager
Date Fri, 23 Jan 2015 01:38:36 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288569#comment-14288569

ASF GitHub Bot commented on FLINK-1352:

Github user StephanEwen commented on the pull request:

    I am not sure that the infinite number of tries is actually bad. This sort of depends
on the situation, I guess:
      - On YARN, it may make sense, because the node will then go back into the pool of available
      - On standalone, it will anyways be there for Flink, so the TaskManager might as well
keep trying to offer itself for work. Think of a network partitioning event - after the partitions
re-joined, the cluster should work as a whole again.
    How about the following: We have a config parameter how long nodes should attempt to register.
YARN could set a timeout (say 2-5 minutes), while by default, the timeout is infinite.
    Concerning the attempt pause: Having attempts with exponential backoff (and a cap) is
the common thing (and I think it was the default before). Start with a 50ms pause and double
it each attempt and cap it at 1 or 2 minutes or so. If you miss early attempts, the pause
will not be long. If you missed an all attempts within the first second, you are guaranteed
to not wait more than twice as long as you already waited anyways.
    For the sake of transparency and making sure that the states are actually in sync: How
about we have three response messages for the registration attempt:
      1. Refused (for whatever reason, the message should have a string that the TM can log)
      2. Accepted (with the assigned ID)
      3. Already registered (with the assigned ID) - The current logic handles this correctly
as well, but this will allow us to log better at the TaskManager and debug problems there
much better. Since this is a mechanism which may have weird cornercase behavior, it would
be good to know as much about what was happening as possible.

> Buggy registration from TaskManager to JobManager
> -------------------------------------------------
>                 Key: FLINK-1352
>                 URL: https://issues.apache.org/jira/browse/FLINK-1352
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>             Fix For: 0.9
> The JobManager's InstanceManager may refuse the registration attempt from a TaskManager,
because it has this taskmanager already connected, or,in the future, because the TaskManager
has been blacklisted as unreliable.
> Unpon refused registration, the instance ID is null, to signal that refused registration.
TaskManager reacts incorrectly to such methods, assuming successful registration
> Possible solution: JobManager sends back a dedicated "RegistrationRefused" message, if
the instance manager returns null as the registration result. If the TastManager receives
that before being registered, it knows that the registration response was lost (which should
not happen on TCP and it would indicate a corrupt connection)
> Followup question: Does it make sense to have the TaskManager trying indefinitely to
connect to the JobManager. With increasing interval (from seconds to minutes)?

This message was sent by Atlassian JIRA

View raw message