spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davies Liu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables
Date Wed, 13 Aug 2014 00:34:15 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094947#comment-14094947
] 

Davies Liu edited comment on SPARK-1065 at 8/13/14 12:34 AM:
-------------------------------------------------------------

The broadcast was not used correctly in the above code, it should be used like this:

{code:python
broadcast_vals = []
for i in range(5):
  datas = [[float(i) for i in range(200)] for i in range(100000)]
  val = sc.broadcast(datas)
  broadcast_vals.append(val)

sc.parallelize([i for i in range(80)]).map(lambda x: sum([len(val.value) for val in broadcast_vals])).collect()
}

The reference of object in Python driver in not necessary in most cases, we will make it optional
(no reference by default), then it can reduce the memory used in Python driver.


was (Author: davies):
The broadcast was not used correctly in the above code, it should be used like this:

{code:
broadcast_vals = []
for i in range(5):
  datas = [[float(i) for i in range(200)] for i in range(100000)]
  val = sc.broadcast(datas)
  broadcast_vals.append(val)

sc.parallelize([i for i in range(80)]).map(lambda x: sum([len(val.value) for val in broadcast_vals])).collect()
}

The reference of object in Python driver in not necessary in most cases, we will make it optional
(no reference by default), then it can reduce the memory used in Python driver.

> PySpark runs out of memory with large broadcast variables
> ---------------------------------------------------------
>
>                 Key: SPARK-1065
>                 URL: https://issues.apache.org/jira/browse/SPARK-1065
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.7.3, 0.8.1, 0.9.0
>            Reporter: Josh Rosen
>            Assignee: Davies Liu
>
> PySpark's driver components may run out of memory when broadcasting large variables (say
1 gigabyte).
> Because PySpark's broadcast is implemented on top of Java Spark's broadcast by broadcasting
a pickled Python as a byte array, we may be retaining multiple copies of the large object:
a pickled copy in the JVM and a deserialized copy in the Python driver.
> The problem could also be due to memory requirements during pickling.
> PySpark is also affected by broadcast variables not being garbage collected.  Adding
an unpersist() method to broadcast variables may fix this: https://github.com/apache/incubator-spark/pull/543.
> As a first step to fixing this, we should write a failing test to reproduce the error.
> This was discovered by [~sandy]: ["trouble with broadcast variables on pyspark"|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message