mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: [Taste] Sanity Check and Questions
Date Fri, 19 Jun 2009 10:14:16 GMT
Following on my last e-mail -- yes I was not crazy and did implement a
basic weighting mechanism like I described, in
PearsonCorrelationSimilarity. You can select it in the constructor and
see what happens.

There is also a LogLikelihoodSimilarity like what Ted mentions. I only
implemented ItemSimilarity but UserSimilarity could be added with a
bit more work -- take a look.

What is the co-occurrence approach Ted?

On Thu, Jun 18, 2009 at 4:43 PM, Ted Dunning<ted.dunning@gmail.com> wrote:
> Grant,
>
> The data you described is pretty simple and should produce good results at
> all levels of overlap.  That it does not is definitely a problem.   In fact,
> I would recommend making the data harder to deal with by giving non-Lincoln
> items highly variable popularities and then making the groundlings rate
> items according to their popularity.  This will result in an apparent
> pattern where the inclusion of any number of non-lincoln fans will show an
> apparent pattern of liking popular items.  The correct inference should,
> however, be that any neighbor group that has a large number of Lincoln fans
> seems to like popular items less than expected.
>
> For problems like this, I have had good luck with using measures that were
> robust in the face of noise (aka small counts) and in the face of large
> external trends (aka the top-40 problem).  The simplest one that I know of
> is the generalized multinomial log-likelihood
> ratio<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>that
> you hear me nattering about so often.  LSI does a decent job of
> dealing
> with the top-40, but has trouble with small counts.  LDA and related
> probabiliistic methods should work somewhat better than log-likelihood
> ratio, but are considerably more complex to implement.
>
> The key here is to compare counts within the local neighborhood to counts
> outside the neighborhood.  Things that are significantly different about the
> neighborhood relative to rest of the world are candidates for
> recommendation.  Things to avoid when looking for interesting differences
> include:
>
> - correlation measures such as Pearson's R (based on normal distribution
> approximation and unscaled thus suffers from both small count and top-40
> problems)
>
> - anomaly measures based simply on frequency ratios (very sensitive to small
> count problems, doesn't account for top-40 at all)
>
> What I would recommend for a nearest neighbor approach is to continue with
> the current neighbor retrieval, but switch to a log-likelihood ratio for
> generating recommendations.
>
> What I would recommend for a production system would be to scrap the nearest
> neighbor approach entirely and go to a coocurrence matrix based approach.
> This costs much less to compute at recommendation and is very robust against
> both small counts and top-40 issues.
>

Mime
View raw message