mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Problems with Mahout's RecommenderIRStatsEvaluator
Date Sat, 16 Feb 2013 19:37:55 GMT
Yes. But: the test sample is small. Using 40% of your data to test is
probably quite too much.

My point is that it may be the least-bad thing to do. What test are you
proposing instead, and why is it coherent with what you're testing?




On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz <ahmetyilmazefendi@yahoo.com>wrote:

> But modeling a user only by his/her low ratings can be problematic since
> people generally are more precise (I believe) in their high ratings.
> Another problem is that recommender algorithms in general first mean
> normalize the ratings for each user. Suppose that we have the following
> ratings of 3 people (A, B, and C) on 5 items.
>
> A's ratings: 1 2 3 4 5
> B's ratings: 1 3 5 2 4
> C's ratings: 1 2 3 4 5
>
>
> Suppose that A is the test user. Now if we put only the low ratings of A
> (1, 2, and 3) into the training set and mean normalize the ratings then A
> will be
> more similar to B than C, which is not true.
>
>
>
>
> ________________________________
>  From: Sean Owen <srowen@gmail.com>
> To: Mahout User List <user@mahout.apache.org>; Ahmet Ylmaz <
> ahmetyilmazefendi@yahoo.com>
> Sent: Saturday, February 16, 2013 8:41 PM
> Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator
>
> No, this is not a problem.
>
> Yes it builds a model for each user, which takes a long time. It's
> accurate, but time-consuming. It's meant for small data. You could rewrite
> your own test to hold out data for all test users at once. That's what I
> did when I rewrote a lot of this just because it was more useful to have
> larger tests.
>
> There are several ways to choose the test data. One common way is by time,
> but there is no time information here by default. The problem is that, for
> example, recent ratings may be low -- or at least not high ratings. But the
> evaluation is of course asking the recommender for items that are predicted
> to be highly rated. Random selection has the same problem. Choosing by
> rating at least makes the test coherent.
>
> It does bias the training set, but, the test set is supposed to be small.
>
> There is no way to actually know, a priori, what the top recommendations
> are. You have no information to evaluate most recommendations. This makes a
> precision/recall test fairly uninformative in practice. Still, it's better
> than nothing and commonly understood.
>
> While precision/recall won't be high on tests like this, because of this, I
> don't get these values for movielens data on any normal algo, but, you may
> be, if choosing an algorithm or parameters that don't work well.
>
>
>
>
> On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ahmetyilmazefendi@yahoo.com
> >wrote:
>
> > Hi,
> >
> > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
> > code. I think that there are two important problems here.
> >
> > According to my understanding the experimental protocol used in this code
> > is something like this:
> >
> > It takes away a certain percentage of users as test users.
> > For
> >  each test user it builds a training set consisting of ratings given by
> > all other users + the ratings of the test user which are below the
> > relevanceThreshold.
> > It then builds a model and makes a
> > recommendation to the test user and finds the intersection between this
> > recommendation list and the items which are rated above the
> > relevanceThreshold by the test user.
> > It then calculates the precision and recall in the usual way.
> >
> > Probems:
> > 1. (mild) It builds a model for every test user which can take a lot of
> > time.
> >
> > 2. (severe) Only the ratings (of the test user) which are below the
> > relevanceThreshold are put into the training set. This means that the
> > algorithm
> > only knows the preferences of the test user about the items which s/he
> > don't like. This is not a good representation of user ratings.
> >
> > Moreover when I run this evaluator on movielens 1m data, the precision
> and
> > recall turned out to be, respectively,
> >
> > 0.011534185658699288
> > 0.007905982905982885
> >
> > and the run took about 13 minutes on my intel core i3. (I used user based
> > recommendation with k=2)
> >
> >
> > Altgough I know that it is not ok to judge the performance of a
> > recommendation algorithm by looking at these absolute precision and
> recall
> > values, still these numbers seems to me too low which might be the result
> > of the second problem I mentioned above.
> >
> > Am I missing something?
> >
> > Thanks
> > Ahmet
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message