spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liang-Chi Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
Date Wed, 06 Jun 2018 06:47:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502876#comment-16502876
] 

Liang-Chi Hsieh commented on SPARK-24447:
-----------------------------------------

I can't reproduce this in current master branch. Can you try on it too?

> Pyspark RowMatrix.columnSimilarities() loses spark context
> ----------------------------------------------------------
>
>                 Key: SPARK-24447
>                 URL: https://issues.apache.org/jira/browse/SPARK-24447
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Perry Chu
>            Priority: Major
>
> The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears
to be losing track of the spark context. 
> I'm pretty new to spark - not sure if the problem is on the python side or the scala
side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline>
> print(sims.entries.context) #<SparkContext master=yarn appName = PySparkShell>,
then throws an error{code}
> Error stack trace
> {code:java}
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
> <ipython-input-47-50f83a6cf449> in <module>()
> ----> 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions,
allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message