spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Petar Petrov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
Date Mon, 05 Feb 2018 20:15:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352870#comment-16352870
] 

Petar Petrov edited comment on SPARK-23182 at 2/5/18 8:14 PM:
--------------------------------------------------------------

We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a
standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7.

It processes about 20000 jobs / day and does not stop. With time many workers join and get
dissociated from the cluster. GCE evicts preemptible VMs without a graceful shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked
(from [https://cloud.google.com/compute/docs/shutdownscript):]

 
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee
that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully stopped, the
master keeps the connection open (although inactive) infinitely long. After some time the
master errors with "Too many open files" and can not accept connections anymore. Thus the
need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's
OS will check the other side and close the connection if it's not responding. 

 


was (Author: pesho82):
We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a
standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7.

It processes about 20000 jobs / day and does not stop. With time many workers join and get
dissociated from the cluster. GCE evicts VMs without a graceful shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked
(from https://cloud.google.com/compute/docs/shutdownscript):

 
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee
that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully stopped, the
master keeps the connection open (although inactive) infinitely long. After some time the
master errors with "Too many open files" and can not accept connections anymore. Thus the
need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's
OS will check the other side and close the connection if it's not responding. 

 

> Allow enabling of TCP keep alive for master RPC connections
> -----------------------------------------------------------
>
>                 Key: SPARK-23182
>                 URL: https://issues.apache.org/jira/browse/SPARK-23182
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Petar Petrov
>            Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines disappear without
closing the TCP connections to the master which increases the number of established connections
and new workers can not connect because of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections to the
master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message