spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhangzhiyan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-17468) Cluster workers crushed when master network bad more than one WORKER_TIMEOUT_MS!
Date Fri, 09 Sep 2016 09:00:34 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhangzhiyan updated SPARK-17468:
--------------------------------
    Description: 
I'm in China commerial company.My production spark standalone is crushed on 9.9 sales, master
log is below:

16/09/09 09:49:57 WARN Master: Removing worker-20160814124907-10.205.130.37-16590 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814113016-10.205.130.13-57487 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814134926-10.205.130.39-11430 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814131257-10.205.130.38-32160 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814161444-10.205.136.19-14196 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814141654-10.205.130.42-49707 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814115125-10.205.130.14-38381 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814152146-10.205.136.10-24730 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814122817-10.205.130.36-54348 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814170452-10.205.136.34-9921 because we
got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Removing worker-20160814154744-10.205.136.12-12399 because
we got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Removing worker-20160814150355-10.205.130.44-5792 because we
got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Removing worker-20160814143901-10.205.130.43-2223 because we
got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814141654-10.205.130.42-49707.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814115125-10.205.130.14-38381.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814134926-10.205.130.39-11430.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814131257-10.205.130.38-32160.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814150355-10.205.130.44-5792.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814154744-10.205.136.12-12399.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814161444-10.205.136.19-14196.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814113016-10.205.130.13-57487.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814152146-10.205.136.10-24730.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814143901-10.205.130.43-2223.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814122817-10.205.130.36-54348.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921.
Asking it to re-register.


and I found the code here may be wrong, when master network is not ok more than WORKER_TIMEOUT_MS,
master will remove worker and executor information in it's memory, but when workers recover
connection again with master quickly,because it's old info has been erased on master, despite
it still running the old executors, master will allocate more resource than worker can afford,that
comes crush my workers

code address:
org.apache.spark.deploy.master.Master,line 1023

  /** Check for, and remove, any timed-out workers */
  private def timeOutDeadWorkers() {
    // Copy the workers into an array so we don't modify the hashset while iterating through
it
    val currentTime = System.currentTimeMillis()
    val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT_MS).toArray
    for (worker <- toRemove) {
      if (worker.state != WorkerState.DEAD) {
        logWarning("Removing %s because we got no heartbeat in %d seconds".format(
          worker.id, WORKER_TIMEOUT_MS / 1000))
        removeWorker(worker)
      } else {
        if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT_MS))
{
          workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough;
cull it
        }
      }
    }
  }



  was:
I'm in China commerial company.My production spark standalone is crushed on 9.9 sales, master
log is below:

16/09/09 09:49:57 WARN Master: Removing worker-20160814124907-10.205.130.37-16590 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814113016-10.205.130.13-57487 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814134926-10.205.130.39-11430 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814131257-10.205.130.38-32160 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814161444-10.205.136.19-14196 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814141654-10.205.130.42-49707 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814115125-10.205.130.14-38381 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814152146-10.205.136.10-24730 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814122817-10.205.130.36-54348 because
we got no heartbeat in 60 seconds
16/09/09 09:49:57 WARN Master: Removing worker-20160814170452-10.205.136.34-9921 because we
got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Removing worker-20160814154744-10.205.136.12-12399 because
we got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Removing worker-20160814150355-10.205.130.44-5792 because we
got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Removing worker-20160814143901-10.205.130.43-2223 because we
got no heartbeat in 60 seconds
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814141654-10.205.130.42-49707.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814115125-10.205.130.14-38381.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814134926-10.205.130.39-11430.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814131257-10.205.130.38-32160.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814150355-10.205.130.44-5792.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814154744-10.205.136.12-12399.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814161444-10.205.136.19-14196.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814113016-10.205.130.13-57487.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814152146-10.205.136.10-24730.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814143901-10.205.130.43-2223.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814122817-10.205.130.36-54348.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590.
Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921.
Asking it to re-register.


and I found the code here may be wrong, when master network is not ok more than WORKER_TIMEOUT_MS,
master will remove worker and executor information in it's memory, but when workers recover
connection with master, it's old info has been erased, despite it still running the old executors,
that comes crush my workers

        Summary: Cluster workers crushed when master network bad more than one WORKER_TIMEOUT_MS!
 (was: Cluster worker memory exceeded when master network bad more than one minute!)

> Cluster workers crushed when master network bad more than one WORKER_TIMEOUT_MS!
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-17468
>                 URL: https://issues.apache.org/jira/browse/SPARK-17468
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>         Environment: CentOS 6.5, Spark standalone, 15 machines,15worker and 2master,there
are worker,master,driver on the same machine
>            Reporter: zhangzhiyan
>            Priority: Critical
>              Labels: Spark, WORKER_TIMEOUT_MS, crush, standalone
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I'm in China commerial company.My production spark standalone is crushed on 9.9 sales,
master log is below:
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814124907-10.205.130.37-16590 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814113016-10.205.130.13-57487 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814134926-10.205.130.39-11430 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814131257-10.205.130.38-32160 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814161444-10.205.136.19-14196 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814141654-10.205.130.42-49707 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814115125-10.205.130.14-38381 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814152146-10.205.136.10-24730 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814122817-10.205.130.36-54348 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:57 WARN Master: Removing worker-20160814170452-10.205.136.34-9921 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:58 WARN Master: Removing worker-20160814154744-10.205.136.12-12399 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:58 WARN Master: Removing worker-20160814150355-10.205.130.44-5792 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:58 WARN Master: Removing worker-20160814143901-10.205.130.43-2223 because
we got no heartbeat in 60 seconds
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814141654-10.205.130.42-49707.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814115125-10.205.130.14-38381.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814134926-10.205.130.39-11430.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814131257-10.205.130.38-32160.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814150355-10.205.130.44-5792.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814154744-10.205.136.12-12399.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814161444-10.205.136.19-14196.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814113016-10.205.130.13-57487.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814152146-10.205.136.10-24730.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814143901-10.205.130.43-2223.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814122817-10.205.130.36-54348.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590.
Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921.
Asking it to re-register.
> and I found the code here may be wrong, when master network is not ok more than WORKER_TIMEOUT_MS,
master will remove worker and executor information in it's memory, but when workers recover
connection again with master quickly,because it's old info has been erased on master, despite
it still running the old executors, master will allocate more resource than worker can afford,that
comes crush my workers
> code address:
> org.apache.spark.deploy.master.Master,line 1023
>   /** Check for, and remove, any timed-out workers */
>   private def timeOutDeadWorkers() {
>     // Copy the workers into an array so we don't modify the hashset while iterating
through it
>     val currentTime = System.currentTimeMillis()
>     val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT_MS).toArray
>     for (worker <- toRemove) {
>       if (worker.state != WorkerState.DEAD) {
>         logWarning("Removing %s because we got no heartbeat in %d seconds".format(
>           worker.id, WORKER_TIMEOUT_MS / 1000))
>         removeWorker(worker)
>       } else {
>         if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT_MS))
{
>           workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough;
cull it
>         }
>       }
>     }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message