spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pablo J. Villacorta (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-26162) ALS results vary with user or item ID encodings
Date Sat, 24 Nov 2018 21:40:00 GMT
Pablo J. Villacorta created SPARK-26162:
-------------------------------------------

             Summary: ALS results vary with user or item ID encodings
                 Key: SPARK-26162
                 URL: https://issues.apache.org/jira/browse/SPARK-26162
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.0
            Reporter: Pablo J. Villacorta


When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors
matrices and the accuracy of the recommendations) differ when we change the labels used to
encode the users or items. The code example below illustrates this by just changing user ID
1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding
to users 1 or 2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
    (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go
from 0 to 4
).toDF("user", "item", "rating")

val maxuser = trainDataset1.select(max("user")).head.getAs[Int](0)
println(s"maxuser is ${maxuser}")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
    _.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also modified accordingly
(in this example, the test dataset is exactly the train dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
    x.recommendForAllUsers(20).map{
        case Row(user: Int, recommendations: WrappedArray[Row]) => {
            val items = recommendations.map{case Row(item: Int, score: Float) => item}
            (user, items)
        }
    }.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
    case (testDataset, recommendationsDF) =>
        testDataset.join(recommendationsDF, "user")
            .rdd.map(r => {
            (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
                r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
            )
        })
}

val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))

println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message