spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Kepser (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice
Date Fri, 28 Oct 2016 10:22:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stephan Kepser updated SPARK-18159:
-----------------------------------
    Comment: was deleted

(was: I saw the old executors kept running for several hours (more than 5h). 
And we have a Stand-alone Spark cluster without Yarn or Mesos. Thus using yarn to kill the
old executors is unfortunately not an option. And killing the old executors via the REST API
also failed. They are immediately re-started. )

> Stand-alone cluster, supervised app: restart of worker hosting the driver causes app
to run twice
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18159
>                 URL: https://issues.apache.org/jira/browse/SPARK-18159
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.2
>            Reporter: Stephan Kepser
>            Priority: Critical
>
> We use Spark in stand-alone cluster mode with HA with three master nodes. All aps are
submitted using
> > spark-submit --deploy-mode cluster --supervised --master ...
> We have many apps running. 
> The deploy-mode cluster is needed to prevent the drivers of the apps to be all placed
on the active master. 
> If a worker goes down that hosts a driver, the following happens:
> * the driver is started on another worker node
> * the new driver does not connect to the still running app
> * the new driver starts a new instance of the running app
> * there are now two instances of the app running, 
>   * one with an attached new driver,
>   * one without a driver.
> * the old instance of the app cannot effectively be stop. I.e., it can be kill via the
UI, but is immediately restarted.
> Iterating this process causes more and more instances of the app running.
> To get the effect both options --deploy-mode cluster and --supervised are required. 

> The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message