mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy
Date Mon, 09 Nov 2009 12:57:21 GMT
Yes, I agree that keeping all pairs is quite expensive, unless your
data set is relatively small (like tens of thousands of items). If
you're not running out of memory, OK, you can get away with it for
now.

But yes, many of the similarities will not contain much information
and don't add much value -- the question is, which ones?

For Pearson correlation-based similarity, it's not just a matter of
keeping the ones with the largest and smallest similarity scores --
nearest 1 or -1. A similarity of 0 could still be very useful
information. I think you would actually want to keep an item-item pair
based on how many users expressed a preference for both items. The
more, the more important it is to keep that pair.

If you'd like an example of efficiently looking through a large list
of things, and keeping only the "top n" of them, see the TopItems
class. You don't want to generate all pairs at once, then throw some
away -- that would still run you out of memory.

Ted will say, and I again I agree, that Pearson is not usually the
best similarity metric, though it is widely mentioned in collaborative
filtering examples and literature.

What Ted quotes below is implemented in the framework as
LogLikelihoodSimilarity. For that, I believe it *is* the pairs with
the largest resulting similarity score that you do want to keep. Or at
least it is more reasonable. Ted maybe you can check my thinking on
that.

Sean

On Mon, Nov 9, 2009 at 7:09 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Close.
>
> See the link below for one approach to finding the most important ones.  I
> believe that Sean has added something like this to Taste/Mahout.
>
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

Mime
View raw message