spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomas Pranckevicius (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
Date Thu, 16 Mar 2017 09:53:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927746#comment-15927746
] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 9:52 AM:
----------------------------------------------------------------------

I am looking as well to the solution of this pyspark crash for the large data set issue on
windows. I have read several posts and spent few days on this problem. I am happy to see that
there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py,
but it still does not provide the positive outcome. Could please write more details on the
change that has to be done in the proposed bandaid of exhausting the iterator at the end of
takeUpToNumLeft() by changing rdd.py file?
            def takeUpToNumLeft():
                iterator = iter(iterator)
                taken = 0
                while taken < left:
                    yield next(iterator)
                    taken += 1


was (Author: tomas pranckevicius):
I am looking as well to the solution of this pyspark crash for large dataset issue on windows.
I have read several posts and spent few days on this problem. I am happy to see that there
is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py,
but it still does not provide the positive outcome. Could please write more details on the
change that has to be done in the proposed bandaid of exhausting the iterator at the end of
takeUpToNumLeft() by changing rdd.py file?
            def takeUpToNumLeft():
                iterator = iter(iterator)
                taken = 0
                while taken < left:
                    yield next(iterator)
                    taken += 1

> pyspark crash for large dataset
> -------------------------------
>
>                 Key: SPARK-12261
>                 URL: https://issues.apache.org/jira/browse/SPARK-12261
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.5.2
>         Environment: windows
>            Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when i ran data.take(),
it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting
job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in <module>
>     lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take
>     res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in runJob
>     port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in __call__
>     answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in
deco
>     return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in get_return_value
>     format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage
0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException:
Connection reset by peer: socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message