spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Shorin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18523) OOM killer may leave SparkContext in broken state causing Connection Refused errors
Date Thu, 31 Aug 2017 09:15:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148698#comment-16148698
] 

Alexander Shorin commented on SPARK-18523:
------------------------------------------

[~kadeng]
I don't have a 2.2.0 in production for now (shame on me!), but will check this. Thanks for
report!

> OOM killer may leave SparkContext in broken state causing Connection Refused errors
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-18523
>                 URL: https://issues.apache.org/jira/browse/SPARK-18523
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.1, 2.0.0
>            Reporter: Alexander Shorin
>            Assignee: Alexander Shorin
>             Fix For: 2.1.0
>
>
> When you run some memory-heavy spark job, Spark driver may consume more memory resources
than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit process.
> The pyspark.SparkContext is not able to handle such state of things and becomes completely
broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java context (jsc)
and fails with Py4JError, because such process no longer exists as like as the connection
to it. 
> You cannot start new SparkContext because you have your broken one as active one and
pyspark still is not able to not have SparkContext as sort of singleton.
> The only thing you can do is shutdown your IPython Notebook and start it over. Or dive
into SparkContext internal attributes and reset them manually to initial None state.
> The OOM killer case is just one of the many: any reason of spark-submit crash in the
middle of something leaves SparkContext in a broken state.
> Example on error log on {{sc.stop()}} in broken state:
> {code}
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 883, in send_command
>     response = connection.send_command(command)
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1040, in send_command
>     "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
(127.0.0.1:59911)
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 963, in start
>     self.socket.connect((self.address, self.port))
>   File "/usr/local/lib/python2.7/socket.py", line 224, in meth
>     return getattr(self._sock,name)(*args)
> error: [Errno 61] Connection refused
> ---------------------------------------------------------------------------
> Py4JError                                 Traceback (most recent call last)
> <ipython-input-2-f154e069615b> in <module>()
> ----> 1 sc.stop()
> /usr/local/share/spark/python/pyspark/context.py in stop(self)
>     360         """
>     361         if getattr(self, "_jsc", None):
> --> 362             self._jsc.stop()
>     363             self._jsc = None
>     364         if getattr(self, "_accumulatorServer", None):
> /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args)
>    1131         answer = self.gateway_client.send_command(command)
>    1132         return_value = get_return_value(
> -> 1133             answer, self.gateway_client, self.target_id, self.name)
>    1134 
>    1135         for temp_arg in temp_args:
> /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>      43     def deco(*a, **kw):
>      44         try:
> ---> 45             return f(*a, **kw)
>      46         except py4j.protocol.Py4JJavaError as e:
>      47             s = e.java_exception.toString()
> /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer,
gateway_client, target_id, name)
>     325             raise Py4JError(
>     326                 "An error occurred while calling {0}{1}{2}".
> --> 327                 format(target_id, ".", name))
>     328     else:
>     329         type = answer[1]
> Py4JError: An error occurred while calling o47.stop
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message