mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Generic Recommender algorithm questions (using Mahout 0.4)
Date Wed, 06 Jul 2011 21:37:00 GMT
On Wed, Jul 6, 2011 at 10:02 PM, Carlos Seminario <>wrote:
> Although this is certainly a sound approach, other approaches have been
> suggested in the literature as cited in
> Can you please provide some insight as to why you selected the above
> prediction calculation approach for Mahout?

This is a simple weighted average, and is the simplest and most canonical
thing to do -- it's what the literature suggests. I can imagine several
other things you can do here, and you're welcome to modify the code to do
them. The framework can't implement all thousand possible things you'd do at
every point so it usually provides the basic, simple pieces and invites you
to extend or modify as you like.

I have written these implementations to reflect the "canonical" and basic
way of doing things, not my inventions or ideas. I'm implementing standard
ideas, not my own.

But I do have plenty of other ideas for you if you like. For example:

In this formulation, the estimated preference value used to rank
recommendations is the mean of all the independent predictions. That's quite
sensible: I think the implicit assumption is that these predictions have
some normal distribution whose mean is the "real" preference for that item.
So the sample mean is as good an estimate of any of that real preference.

One problem is that this takes no account of your certainty about how close
the sample mean and real mean are. For instance, the mean of 100 predictions
is probably more reliable than 1, right? You know that the population mean
is far more likely to be close to the sample mean.

You could use this idea directly by ranking by sample mean minus sample
standard deviation, instead of just sample mean. That's not an estimate of
the actual preference, but a sort of lower bound on what the preference is
probably larger than.

I also noticed that Mahout has implemented the following
> PearsonCorrelationSimilarity weighting when the WEIGHTED parameter is used
> in the similarity constructor:
> Would you please provide some insight as to why you decided to use this
> weighting approach?

This is somewhat made-up. There is not some strong mathematical
justification for it. I can explain the intuition behind why this is
sensible but I think the answer is just that it is a crude adjustment to a
similarity metric you probably won't use anyway, but that is so well-known
needs to be supported.

> It appears that Mahout calculates similarities between users to determine
> the neighborhood and then again during the prediction calculation. When
> running an evaluator (e.g., DifferenceRecommenderEvaluator), I can see that
> the user similarities are computed repeatedly for each user. Is there a
> reason why it was implemented this way? (“time vs space” tradeoff?)

UserSimilarity implementations always compute user-user similarity. You can
wrap in CachingUserSimilarity if you want it cached. These are separate

Can you provide some insight as to why you decided to use this approach?
> Were
> there any other approaches you considered but rejected, and if so, why did
> you reject them?
Same as #1, this is just a simple weighted average.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message