spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From m...@apache.org
Subject git commit: SPARK-3770: Make userFeatures accessible from python
Date Tue, 21 Oct 2014 18:49:46 GMT
Repository: spark
Updated Branches:
  refs/heads/master 61ca7742d -> 1a623b2e1


SPARK-3770: Make userFeatures accessible from python

https://issues.apache.org/jira/browse/SPARK-3770

We need access to the underlying latent user features from python. However, the userFeatures
RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added
a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String].
This is then accessed from the python recommendation.py

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #2636 from mdagost/mf_user_features and squashes the following commits:

c98f9e2 [Michelangelo D'Agostino] Added unit tests for userFeatures and productFeatures and
merged master.
d5eadf8 [Michelangelo D'Agostino] Merge branch 'master' into mf_user_features
2481a2a [Michelangelo D'Agostino] Merged master and resolved conflict.
a6ffb96 [Michelangelo D'Agostino] Eliminated a function from our first approach to this problem
that is no longer needed now that we added the fromTuple2RDD function.
2aa1bf8 [Michelangelo D'Agostino] Implemented a function called fromTuple2RDD in PythonMLLibAPI
and used it to expose the MF userFeatures and productFeatures in python.
34cb2a2 [Michelangelo D'Agostino] A couple of lint cleanups and a comment.
cdd98e3 [Michelangelo D'Agostino] It's working now.
e1fbe5e [Michelangelo D'Agostino] Added scala function to stringify userFeatures for access
in python.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1a623b2e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1a623b2e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1a623b2e

Branch: refs/heads/master
Commit: 1a623b2e163da3a9112cb9b68bda22b9e398ed5c
Parents: 61ca774
Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>
Authored: Tue Oct 21 11:49:39 2014 -0700
Committer: Xiangrui Meng <meng@databricks.com>
Committed: Tue Oct 21 11:49:39 2014 -0700

----------------------------------------------------------------------
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  5 ++++
 python/pyspark/mllib/recommendation.py          | 31 ++++++++++++++++++++
 2 files changed, 36 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/1a623b2e/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 9a10017..b478c21 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -673,6 +673,11 @@ private[spark] object SerDe extends Serializable {
     rdd.map(x => (x(0).asInstanceOf[Int], x(1).asInstanceOf[Int]))
   }
 
+  /* convert RDD[Tuple2[,]] to RDD[Array[Any]] */
+  def fromTuple2RDD(rdd: RDD[Tuple2[Any, Any]]): RDD[Array[Any]]  = {
+    rdd.map(x => Array(x._1, x._2))
+  }
+
   /**
    * Convert an RDD of Java objects to an RDD of serialized Python objects, that is usable
by
    * PySpark.

http://git-wip-us.apache.org/repos/asf/spark/blob/1a623b2e/python/pyspark/mllib/recommendation.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/recommendation.py b/python/pyspark/mllib/recommendation.py
index 17f96b8..22872db 100644
--- a/python/pyspark/mllib/recommendation.py
+++ b/python/pyspark/mllib/recommendation.py
@@ -53,6 +53,23 @@ class MatrixFactorizationModel(object):
     >>> model = ALS.train(ratings, 1)
     >>> model.predictAll(testset).count() == 2
     True
+
+    >>> model = ALS.train(ratings, 4)
+    >>> model.userFeatures().count() == 2
+    True
+
+    >>> first_user = model.userFeatures().take(1)[0]
+    >>> latents = first_user[1]
+    >>> len(latents) == 4
+    True
+
+    >>> model.productFeatures().count() == 2
+    True
+
+    >>> first_product = model.productFeatures().take(1)[0]
+    >>> latents = first_product[1]
+    >>> len(latents) == 4
+    True
     """
 
     def __init__(self, sc, java_model):
@@ -83,6 +100,20 @@ class MatrixFactorizationModel(object):
         return RDD(sc._jvm.SerDe.javaToPython(jresult), sc,
                    AutoBatchedSerializer(PickleSerializer()))
 
+    def userFeatures(self):
+        sc = self._context
+        juf = self._java_model.userFeatures()
+        juf = sc._jvm.SerDe.fromTuple2RDD(juf).toJavaRDD()
+        return RDD(sc._jvm.PythonRDD.javaToPython(juf), sc,
+                   AutoBatchedSerializer(PickleSerializer()))
+
+    def productFeatures(self):
+        sc = self._context
+        jpf = self._java_model.productFeatures()
+        jpf = sc._jvm.SerDe.fromTuple2RDD(jpf).toJavaRDD()
+        return RDD(sc._jvm.PythonRDD.javaToPython(jpf), sc,
+                   AutoBatchedSerializer(PickleSerializer()))
+
 
 class ALS(object):
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message