Mailing-List: contact issues-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Date: Thu, 24 Nov 2016 12:29:58 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: issues@flink.apache.org
Message-ID: <JIRA.13008566.1475162145000.360361.1479990598520@Atlassian.JIRA>
In-Reply-To: <JIRA.13008566.1475162145000@Atlassian.JIRA>
References: <JIRA.13008566.1475162145000@Atlassian.JIRA> <JIRA.13008566.1475162145961@arcas>
Subject: [jira] [Commented] (FLINK-4712) Implementing ranking predictions
 for ALS
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 24 Nov 2016 12:30:00 -0000


    [ https://issues.apache.org/jira/browse/FLINK-4712?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1569=
3161#comment-15693161 ]=20

ASF GitHub Bot commented on FLINK-4712:
---------------------------------------

Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2838#discussion_r89489117
 =20
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/p=
ipeline/Predictor.scala ---
    @@ -72,14 +77,142 @@ trait Predictor[Self] extends Estimator[Self] with=
 WithParameters {
         */
       def evaluate[Testing, PredictionValue](
           testing: DataSet[Testing],
    -      evaluateParameters: ParameterMap =3D ParameterMap.Empty)(implici=
t
    -      evaluator: EvaluateDataSetOperation[Self, Testing, PredictionVal=
ue])
    +      evaluateParameters: ParameterMap =3D ParameterMap.Empty)
    +      (implicit evaluator: EvaluateDataSetOperation[Self, Testing, Pre=
dictionValue])
         : DataSet[(PredictionValue, PredictionValue)] =3D {
         FlinkMLTools.registerFlinkMLTypes(testing.getExecutionEnvironment)
         evaluator.evaluateDataSet(this, evaluateParameters, testing)
       }
     }
    =20
    +trait RankingPredictor[Self] extends Estimator[Self] with WithParamete=
rs {
    +  that: Self =3D>
    +
    +  def predictRankings(
    +    k: Int,
    +    users: DataSet[Int],
    +    predictParameters: ParameterMap =3D ParameterMap.Empty)(implicit
    +    rankingPredictOperation : RankingPredictOperation[Self])
    +  : DataSet[(Int,Int,Int)] =3D
    +    rankingPredictOperation.predictRankings(this, k, users, predictPar=
ameters)
    +
    +  def evaluateRankings(
    +    testing: DataSet[(Int,Int,Double)],
    +    evaluateParameters: ParameterMap =3D ParameterMap.Empty)(implicit
    +    rankingPredictOperation : RankingPredictOperation[Self])
    +  : DataSet[(Int,Int,Int)] =3D {
    +    // todo: do not burn 100 topK into code
    +    predictRankings(100, testing.map(_._1).distinct(), evaluateParamet=
ers)
    +  }
    +}
    +
    +trait RankingPredictOperation[Instance] {
    +  def predictRankings(
    +    instance: Instance,
    +    k: Int,
    +    users: DataSet[Int],
    +    predictParameters: ParameterMap =3D ParameterMap.Empty)
    +  : DataSet[(Int, Int, Int)]
    +}
    +
    +/**
    +  * Trait for providing auxiliary data for ranking evaluations.
    +  *
    +  * They are useful e.g. for excluding items found in the training [[D=
ataSet]]
    +  * from the recommended top K items.
    +  */
    +trait TrainingRatingsProvider {
    +
    +  def getTrainingData: DataSet[(Int, Int, Double)]
    +
    +  /**
    +    * Retrieving the training items.
    +    * Although this can be calculated from the training data, it requi=
res a costly
    +    * [[DataSet.distinct]] operation, while in matrix factor models th=
e set items could be
    +    * given more efficiently from the item factors.
    +    */
    +  def getTrainingItems: DataSet[Int] =3D {
    +    getTrainingData.map(_._2).distinct()
    +  }
    +}
    +
    +/**
    +  * Ranking predictions for the most common case.
    +  * If we can predict ratings, we can compute top K lists by sorting t=
he predicted ratings.
    +  */
    +class RankingFromRatingPredictOperation[Instance <: TrainingRatingsPro=
vider]
    +(val ratingPredictor: PredictDataSetOperation[Instance, (Int, Int), (I=
nt, Int, Double)])
    +  extends RankingPredictOperation[Instance] {
    +
    +  private def getUserItemPairs(users: DataSet[Int], items: DataSet[Int=
], exclude: DataSet[(Int, Int)])
    +  : DataSet[(Int, Int)] =3D {
    +    users.cross(items)
    --- End diff --
   =20
    You're right. Although there's not much we can do generally to avoid th=
is, we might be able to optimize for matrix factorization. This solution wo=
rks for *every* predictor that predicts ratings, and we currently use it in=
 ALS ([here](https://github.com/apache/flink/pull/2838/files/45c98a97ef82d1=
012062dbcf6ade85a8d566062d#diff-80639a21b8fd166b5f7df5280cd609a9R467)). Wit=
h a matrix factorization model *specifically*, we can avoid materializing a=
ll user-item pairs as tuples, and compute the rankings more directly, and t=
hat might be more efficient. So we could use a more specific `RankingPredic=
tor` implementation in `ALS`. But even in that case, we still need to go th=
rough all the items for a particular user to calculate the top k items for =
that user.
   =20
    Also this is only calculated with for the users we'd like to give ranki=
ngs to. E.g. in a testing scenario, for the users in the test data which mi=
ght be significantly less than the users in the training data.
   =20
    I suggest to keep this anyway as this is general. We might come up with=
 a solution that's slightly efficient in most cases for MF models. Should p=
ut effort in working on it? What do you think?


> Implementing ranking predictions for ALS
> ----------------------------------------
>
>                 Key: FLINK-4712
>                 URL: https://issues.apache.org/jira/browse/FLINK-4712
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Domokos Mikl=C3=B3s Kelen
>            Assignee: G=C3=A1bor Hermann
>
> We started working on implementing ranking predictions for recommender sy=
stems. Ranking prediction means that beside predicting scores for user-item=
 pairs, the recommender system is able to recommend a top K list for the us=
ers.
> Details:
> In practice, this would mean finding the K items for a particular user wi=
th the highest predicted rating. It should be possible also to specify whet=
her to exclude the already seen items from a particular user's toplist. (Se=
e for example the 'exclude_known' setting of [Graphlab Create's ranking fac=
torization recommender|https://turi.com/products/create/docs/generated/grap=
hlab.recommender.ranking_factorization_recommender.RankingFactorizationReco=
mmender.recommend.html#graphlab.recommender.ranking_factorization_recommend=
er.RankingFactorizationRecommender.recommend] ).
> The output of the topK recommendation function could be in the form of {{=
DataSet[(Int,Int,Int)]}}, meaning (user, item, rank), similar to Graphlab C=
reate's output. However, this is arguable: follow up work includes implemen=
ting ranking recommendation evaluation metrics (such as precision@k, recall=
@k, ndcg@k), similar to [Spark's implementations|https://spark.apache.org/d=
ocs/1.5.0/mllib-evaluation-metrics.html#ranking-systems]. It would be benef=
icial if we were able to design the API such that it could be included in t=
he proposed evaluation framework (see [5157|https://issues.apache.org/jira/=
browse/FLINK-2157]), which makes it neccessary to consider the possible out=
put type {{DataSet[(Int, Array[Int])]}} or {{DataSet[(Int, Array[(Int,Doubl=
e)])]}} meaning (user, array of items), possibly including the predicted sc=
ores as well. See [4713|https://issues.apache.org/jira/browse/FLINK-4713] f=
or details.
> Another question arising is whether to provide this function as a member =
of the ALS class, as a switch-kind of parameter to the ALS implementation (=
meaning the model is either a rating or a ranking recommender model) or in =
some other way.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)