spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "niranda perera (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
Date Tue, 19 Apr 2016 20:56:25 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

niranda perera updated SPARK-14736:
-----------------------------------
    Affects Version/s: 1.5.0
                       1.6.0

> Deadlock in registering applications while the Master is in the RECOVERING mode
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-14736
>                 URL: https://issues.apache.org/jira/browse/SPARK-14736
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1, 1.5.0, 1.6.0
>         Environment: unix, Spark cluster with a custom StandaloneRecoveryModeFactory
and a custom PersistenceEngine
>            Reporter: niranda perera
>            Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some issue, the entire
cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 'recovering'
mode, because it sees some apps, workers and drivers have already been in the cluster from
Persistence Engine. While in the recovery process, the application comes back online, but
now it would have a different ID, let's say B. 
> But then, as per the master, application registration logic, this application B will
NOT be added to the 'waitingApps' with the message ""Attempted to re-register application
at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
>     val appAddress = app.driver.address
>     if (addressToApp.contains(appAddress)) {
>       logInfo("Attempted to re-register application at same address: " + appAddress)
>       return
>     }
> The problem here is, master is trying to recover application A, which is not in there
anymore. Therefore after the recovery process, app A will be dropped. However app A's successor,
app B was also omitted from the 'waitingApps' list because it had the same address as App
A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the cluster.

> When the master is in the RECOVERING mode, shouldn't it add all the registering apps
to a list first, and then after the recovery is completed (once the unsuccessful recoveries
are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message