spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient
Date Tue, 06 May 2014 19:58:25 GMT

     [ https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matei Zaharia resolved SPARK-1685.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.2
                   1.0.0

> retryTimer not canceled on actor restart in Worker and AppClient
> ----------------------------------------------------------------
>
>                 Key: SPARK-1685
>                 URL: https://issues.apache.org/jira/browse/SPARK-1685
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0, 1.0.0, 0.9.1
>            Reporter: Mark Hamstra
>            Assignee: Mark Hamstra
>             Fix For: 1.0.0, 0.9.2
>
>
> Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when
those Actors start.  The attempt at registration is accomplished by starting a retryTimer
via the Akka scheduler that will use the registered timeout interval and retry number to make
repeated attempts to register with all known Masters before giving up and either marking as
dead or calling System.exit.
> The receive methods of these actors can, however, throw exceptions, which will lead to
the actor restarting, registerWithMaster being called again on restart, and another retryTimer
being scheduled without canceling the already running retryTimer.  Assuming that all of the
rest of the restart logic is correct for these actors (which I don't believe is actually a
given), having multiple retryTimers running presents at least a condition in which the restarted
actor may not be able to make the full number of retry attempts before an earlier retryTimer
takes the "give up" action.
> Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message