mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: [Taste] Sanity Check and Questions
Date Thu, 18 Jun 2009 20:43:54 GMT

The data you described is pretty simple and should produce good results at
all levels of overlap.  That it does not is definitely a problem.   In fact,
I would recommend making the data harder to deal with by giving non-Lincoln
items highly variable popularities and then making the groundlings rate
items according to their popularity.  This will result in an apparent
pattern where the inclusion of any number of non-lincoln fans will show an
apparent pattern of liking popular items.  The correct inference should,
however, be that any neighbor group that has a large number of Lincoln fans
seems to like popular items less than expected.

For problems like this, I have had good luck with using measures that were
robust in the face of noise (aka small counts) and in the face of large
external trends (aka the top-40 problem).  The simplest one that I know of
is the generalized multinomial log-likelihood
you hear me nattering about so often.  LSI does a decent job of
with the top-40, but has trouble with small counts.  LDA and related
probabiliistic methods should work somewhat better than log-likelihood
ratio, but are considerably more complex to implement.

The key here is to compare counts within the local neighborhood to counts
outside the neighborhood.  Things that are significantly different about the
neighborhood relative to rest of the world are candidates for
recommendation.  Things to avoid when looking for interesting differences

- correlation measures such as Pearson's R (based on normal distribution
approximation and unscaled thus suffers from both small count and top-40

- anomaly measures based simply on frequency ratios (very sensitive to small
count problems, doesn't account for top-40 at all)

What I would recommend for a nearest neighbor approach is to continue with
the current neighbor retrieval, but switch to a log-likelihood ratio for
generating recommendations.

What I would recommend for a production system would be to scrap the nearest
neighbor approach entirely and go to a coocurrence matrix based approach.
This costs much less to compute at recommendation and is very robust against
both small counts and top-40 issues.

On Thu, Jun 18, 2009 at 9:37 AM, Sean Owen <> wrote:

> Still seems a little funny. I am away from the code otherwise I would check
> - forget if I ever implemented weighting in the standard correlation
> similarity metrics. A pure correlation does not account for whether you
> overlapped in 10 or 1000 items. This sort of weighting exists elsewhere but
> I forget about here. It might help.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message