spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-18274) Memory leak in PySpark StringIndexer
Date Thu, 10 Nov 2016 16:25:59 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-18274:
------------------------------------

    Assignee: Apache Spark

> Memory leak in PySpark StringIndexer
> ------------------------------------
>
>                 Key: SPARK-18274
>                 URL: https://issues.apache.org/jira/browse/SPARK-18274
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.1
>            Reporter: Jonas Amrich
>            Assignee: Apache Spark
>
> StringIndexerModel won't get collected by GC in Java even when deleted in Python. It
can be reproduced by this code, which fails after couple of iterations (around 7 if you set
driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]
 # 700000 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
>     indexer = StringIndexer(inputCol='string', outputCol='index')
>     indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
>     indexer = StringIndexer(inputCol='string', outputCol='index')
>     indexer.fit(df)
>     gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm detach in
model's destructor. This is implemented in pyspark.mlib.common.JavaModelWrapper but missing
in pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be affected by this
memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message