Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 515D8200BCB for ; Thu, 24 Nov 2016 13:30:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 504AA160B1E; Thu, 24 Nov 2016 12:30:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 73096160B11 for ; Thu, 24 Nov 2016 13:29:59 +0100 (CET) Received: (qmail 98760 invoked by uid 500); 24 Nov 2016 12:29:58 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 98750 invoked by uid 99); 24 Nov 2016 12:29:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Nov 2016 12:29:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7FEBA2C03DF for ; Thu, 24 Nov 2016 12:29:58 +0000 (UTC) Date: Thu, 24 Nov 2016 12:29:58 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-4712) Implementing ranking predictions for ALS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 24 Nov 2016 12:30:00 -0000 [ https://issues.apache.org/jira/browse/FLINK-4712?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1569= 3161#comment-15693161 ]=20 ASF GitHub Bot commented on FLINK-4712: --------------------------------------- Github user gaborhermann commented on a diff in the pull request: https://github.com/apache/flink/pull/2838#discussion_r89489117 =20 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/p= ipeline/Predictor.scala --- @@ -72,14 +77,142 @@ trait Predictor[Self] extends Estimator[Self] with= WithParameters { */ def evaluate[Testing, PredictionValue]( testing: DataSet[Testing], - evaluateParameters: ParameterMap =3D ParameterMap.Empty)(implici= t - evaluator: EvaluateDataSetOperation[Self, Testing, PredictionVal= ue]) + evaluateParameters: ParameterMap =3D ParameterMap.Empty) + (implicit evaluator: EvaluateDataSetOperation[Self, Testing, Pre= dictionValue]) : DataSet[(PredictionValue, PredictionValue)] =3D { FlinkMLTools.registerFlinkMLTypes(testing.getExecutionEnvironment) evaluator.evaluateDataSet(this, evaluateParameters, testing) } } =20 +trait RankingPredictor[Self] extends Estimator[Self] with WithParamete= rs { + that: Self =3D> + + def predictRankings( + k: Int, + users: DataSet[Int], + predictParameters: ParameterMap =3D ParameterMap.Empty)(implicit + rankingPredictOperation : RankingPredictOperation[Self]) + : DataSet[(Int,Int,Int)] =3D + rankingPredictOperation.predictRankings(this, k, users, predictPar= ameters) + + def evaluateRankings( + testing: DataSet[(Int,Int,Double)], + evaluateParameters: ParameterMap =3D ParameterMap.Empty)(implicit + rankingPredictOperation : RankingPredictOperation[Self]) + : DataSet[(Int,Int,Int)] =3D { + // todo: do not burn 100 topK into code + predictRankings(100, testing.map(_._1).distinct(), evaluateParamet= ers) + } +} + +trait RankingPredictOperation[Instance] { + def predictRankings( + instance: Instance, + k: Int, + users: DataSet[Int], + predictParameters: ParameterMap =3D ParameterMap.Empty) + : DataSet[(Int, Int, Int)] +} + +/** + * Trait for providing auxiliary data for ranking evaluations. + * + * They are useful e.g. for excluding items found in the training [[D= ataSet]] + * from the recommended top K items. + */ +trait TrainingRatingsProvider { + + def getTrainingData: DataSet[(Int, Int, Double)] + + /** + * Retrieving the training items. + * Although this can be calculated from the training data, it requi= res a costly + * [[DataSet.distinct]] operation, while in matrix factor models th= e set items could be + * given more efficiently from the item factors. + */ + def getTrainingItems: DataSet[Int] =3D { + getTrainingData.map(_._2).distinct() + } +} + +/** + * Ranking predictions for the most common case. + * If we can predict ratings, we can compute top K lists by sorting t= he predicted ratings. + */ +class RankingFromRatingPredictOperation[Instance <: TrainingRatingsPro= vider] +(val ratingPredictor: PredictDataSetOperation[Instance, (Int, Int), (I= nt, Int, Double)]) + extends RankingPredictOperation[Instance] { + + private def getUserItemPairs(users: DataSet[Int], items: DataSet[Int= ], exclude: DataSet[(Int, Int)]) + : DataSet[(Int, Int)] =3D { + users.cross(items) --- End diff -- =20 You're right. Although there's not much we can do generally to avoid th= is, we might be able to optimize for matrix factorization. This solution wo= rks for *every* predictor that predicts ratings, and we currently use it in= ALS ([here](https://github.com/apache/flink/pull/2838/files/45c98a97ef82d1= 012062dbcf6ade85a8d566062d#diff-80639a21b8fd166b5f7df5280cd609a9R467)). Wit= h a matrix factorization model *specifically*, we can avoid materializing a= ll user-item pairs as tuples, and compute the rankings more directly, and t= hat might be more efficient. So we could use a more specific `RankingPredic= tor` implementation in `ALS`. But even in that case, we still need to go th= rough all the items for a particular user to calculate the top k items for = that user. =20 Also this is only calculated with for the users we'd like to give ranki= ngs to. E.g. in a testing scenario, for the users in the test data which mi= ght be significantly less than the users in the training data. =20 I suggest to keep this anyway as this is general. We might come up with= a solution that's slightly efficient in most cases for MF models. Should p= ut effort in working on it? What do you think? > Implementing ranking predictions for ALS > ---------------------------------------- > > Key: FLINK-4712 > URL: https://issues.apache.org/jira/browse/FLINK-4712 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Domokos Mikl=C3=B3s Kelen > Assignee: G=C3=A1bor Hermann > > We started working on implementing ranking predictions for recommender sy= stems. Ranking prediction means that beside predicting scores for user-item= pairs, the recommender system is able to recommend a top K list for the us= ers. > Details: > In practice, this would mean finding the K items for a particular user wi= th the highest predicted rating. It should be possible also to specify whet= her to exclude the already seen items from a particular user's toplist. (Se= e for example the 'exclude_known' setting of [Graphlab Create's ranking fac= torization recommender|https://turi.com/products/create/docs/generated/grap= hlab.recommender.ranking_factorization_recommender.RankingFactorizationReco= mmender.recommend.html#graphlab.recommender.ranking_factorization_recommend= er.RankingFactorizationRecommender.recommend] ). > The output of the topK recommendation function could be in the form of {{= DataSet[(Int,Int,Int)]}}, meaning (user, item, rank), similar to Graphlab C= reate's output. However, this is arguable: follow up work includes implemen= ting ranking recommendation evaluation metrics (such as precision@k, recall= @k, ndcg@k), similar to [Spark's implementations|https://spark.apache.org/d= ocs/1.5.0/mllib-evaluation-metrics.html#ranking-systems]. It would be benef= icial if we were able to design the API such that it could be included in t= he proposed evaluation framework (see [5157|https://issues.apache.org/jira/= browse/FLINK-2157]), which makes it neccessary to consider the possible out= put type {{DataSet[(Int, Array[Int])]}} or {{DataSet[(Int, Array[(Int,Doubl= e)])]}} meaning (user, array of items), possibly including the predicted sc= ores as well. See [4713|https://issues.apache.org/jira/browse/FLINK-4713] f= or details. > Another question arising is whether to provide this function as a member = of the ALS class, as a switch-kind of parameter to the ALS implementation (= meaning the model is either a rating or a ranking recommender model) or in = some other way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)