spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
Date Tue, 12 Apr 2016 09:20:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236796#comment-15236796
] 

Nick Pentreath edited comment on SPARK-13857 at 4/12/16 9:19 AM:
-----------------------------------------------------------------

[~mengxr] [~josephkb]

In an ideal world, this is what train-validation split with ALS with ranking evaluation would
look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

This issue is, the input dataset to {{fit}} ALS is DF of {{(userId, itemId, rating)}} rows.
The input to {{transform}} with the top-k option enabled is a DF of {{userId}} rows, while
the input to {{evaluate}} is a DF of {{(userId, actual)}} rows, where {{actual}} is an array
of ground truth item ids {{(id1, id2, ...)}}. So it doesn't work out the box.

I see three solutions:
# have {{RankingEvaluator}} and/or the cross-validation classes handle this in some generic
way (it would be good to understand how other ranking evaluation use cases could look in order
to also support them).
# have {{ALS}} handle it in {{transform}} - perhaps an option to output a {{topk}} column
and an {{actual}} column. This would require that the input DF to {{transform}} with the top-k
option is in the same form as for {{transform}} normally. It would require a distinct on the
{{userId}} column to only predict for unique user ids, and may be a bit convoluted to make
it work.
# have specialized versions of {{TrainValidationSplit}} and {{CrossValidator}} to handle the
recommendation case.

#3 is not actually as crazy as it may sound - since the recommendation case is a little different
(the same might be the case for say learning-to-rank on queries etc), and even the way to
split the dataset into train,test is different (e.g. in recommender systems, often data is
sampled by userid, such as split a fraction of ratings for each user into the train and test
sets, etc).


was (Author: mlnick):
[~mengxr] [~josephkb]

In an ideal world, this is what train-validation split with ALS would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

> Feature parity for ALS ML with MLLIB
> ------------------------------------
>
>                 Key: SPARK-13857
>                 URL: https://issues.apache.org/jira/browse/SPARK-13857
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Nick Pentreath
>            Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods {{recommendProducts/recommendUsers}}
for recommending top K to a given user / item, as well as {{recommendProductsForUsers/recommendUsersForProducts}}
to recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do {{recommendProductsForUsers}}
for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message