flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Hermann (JIRA) <j...@apache.org>
Subject [jira] [Commented] (FLINK-4713) Implementing ranking evaluation scores for recommender systems
Date Wed, 02 Nov 2016 14:20:58 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629109#comment-15629109

Gábor Hermann commented on FLINK-4713:

We have managed to rework the evaluation framework proposed by Theodore, so that ranking predictions
would fit in. Our approach is to use separate {{RankingPredictor}} and {{Predictor}} traits.
One main problem however remains: there is no common superclass for {{RankingPredictor}} and
{{Predictor}} so the pipelining mechanism might not work. A {{Predictor}} can only be at the
and of the pipeline, so this should not really be a problem, but I do not know for sure. An
alternative solution would be to have different objects {{ALS}} and {{RankingALS}} that give
different predictions, but both extends only a {{Predictor}}. There could be implicit conversions
between the two. I would prefer the current solution if it does not break the pipelining.
[~tvas] What do you think about this?

(This seems to be a problem similar to having a {{predict_proba}} function in scikit learn
classification models, where the same model for the same input gives two different predictions:
a {{predict}} for discrete predictions and {{predict_proba}} for giving a probability.)

On the other hand, we seem to have solved the scoring issue. The users can evaluate a recommendation
algorithm such as ALS by using a score operating on rankings (e.g. NDCG), or a score operating
on ratings (e.g. RMSE). They only need to modify the {{Score}} they use in their code, and
nothing else.

The main problem was that the {{evaluate}} method and {{EvaluateDataSetOperation}} were not
general enough. They prepare the evaluation to {{(trueValue, predictedValue)}} pairs (i.e.
a {{DataSet\[(PredictionType, PredictionType)\]}}), while ranking evaluations needed a more
general input with the true ratings ({{DataSet\[(Int,Int,Double)\]}}) and the predicted rankings

Instead of using {{EvaluateDataSetOperation}} we use a more general {{PrepareOperation}}.
We rename the {{Score}} in the original evaluation framework to {{PairwiseScore}}. {{RankingScore}}
and {{PairwiseScore}} has a common trait {{AbstractScore}}. This way the user can use both
a {{RankingScore}} and a {{PairwiseScore}} for a certain model, and only need to alter the
score used in the code.

In case of pairwise scores (that only need true and predicted value pairs for evaluation)
{{EvaluateDataSetOperation}} is used as a {{PrepareOperation}}. It prepares the evaluation
by creating {{(trueValue, predicitedValue)}} pairs from the test dataset. Thus, the result
of preparing and the input of {{PairwiseScore}}s will be {{DataSet\[(PredictionType,PredictionType)\]}}.
In case of rankings the {{PrepareOperation}} passes the test dataset and creates the rankings.
The result of preparing and the input of {{RankingScore}}s will be {{(DataSet\[Int,Int,Double\],
DataSet\[Int,Int,Int\])}}. I believe this is a fairly acceptable solution that avoids breaking
the API.

We did not go along with the implementation, documentation, and cleaning up the code, as we
need feedback regarding API decisions. Are we on the right path? What do you think about our
solution? How acceptable is it?

The sketch code can be found on this branch:

> Implementing ranking evaluation scores for recommender systems
> --------------------------------------------------------------
>                 Key: FLINK-4713
>                 URL: https://issues.apache.org/jira/browse/FLINK-4713
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Domokos Miklós Kelen
>            Assignee: Gábor Hermann
> Follow up work to [4712|https://issues.apache.org/jira/browse/FLINK-4712] includes implementing
ranking recommendation evaluation metrics (such as precision@k, recall@k, ndcg@k), [similar
to Spark's implementations|https://spark.apache.org/docs/1.5.0/mllib-evaluation-metrics.html#ranking-systems].
It would be beneficial if we were able to design the API such that it could be included in
the proposed evaluation framework (see [2157|https://issues.apache.org/jira/browse/FLINK-2157]).
> In it's current form, this would mean generalizing the PredictionType type parameter
of the Score class to allow for {{Array[Int]}} or {{Array[(Int, Double)]}}, and outputting
the recommendations in the form {{DataSet[(Int, Array[Int])]}} or {{DataSet[(Int, Array[(Int,Double)])]}}
meaning (user, array of items), possibly including the predicted scores as well. 
> However, calculating for example nDCG for a given user u requires us to be able to access
all of the (u, item, relevance) records in the test dataset, which means we would need to
put this information in the second element of the {{DataSet[(PredictionType, PredictionType)]}}
input of the scorer function as PredictionType={{Array[(Int, Double)]}}. This is problematic,
as this Array could be arbitrarily long.
> Another option is to further rework the proposed evaluation framework to allow us to
implement this properly, with inputs in the form of {{recommendations : DataSet[(Int,Int,Int)]}}
(user, item, rank) and {{test : DataSet[(Int,Int,Double)]}} (user, item relevance). This way,
the scores could be implemented such that they can be calculated in a distributed way.
> The third option is to implement the scorer functions outside the evaluation framework.

This message was sent by Atlassian JIRA

View raw message