flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1352) Buggy registration from TaskManager to JobManager
Date Mon, 26 Jan 2015 15:22:34 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291940#comment-14291940
] 

ASF GitHub Bot commented on FLINK-1352:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/328#issuecomment-71476676
  
    I updated the PR with the exponential backoff registration strategy. On the way, I fixed
the flakey RecoveryIT case.


> Buggy registration from TaskManager to JobManager
> -------------------------------------------------
>
>                 Key: FLINK-1352
>                 URL: https://issues.apache.org/jira/browse/FLINK-1352
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>             Fix For: 0.9
>
>
> The JobManager's InstanceManager may refuse the registration attempt from a TaskManager,
because it has this taskmanager already connected, or,in the future, because the TaskManager
has been blacklisted as unreliable.
> Unpon refused registration, the instance ID is null, to signal that refused registration.
TaskManager reacts incorrectly to such methods, assuming successful registration
> Possible solution: JobManager sends back a dedicated "RegistrationRefused" message, if
the instance manager returns null as the registration result. If the TastManager receives
that before being registered, it knows that the registration response was lost (which should
not happen on TCP and it would indicate a corrupt connection)
> Followup question: Does it make sense to have the TaskManager trying indefinitely to
connect to the JobManager. With increasing interval (from seconds to minutes)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message