mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Offline computation of item-item similarities
Date Thu, 29 Oct 2009 11:19:43 GMT
On Wed, Oct 28, 2009 at 3:06 PM, Gökhan Çapan <gkhncpn@gmail.com> wrote:
> similarities at once. For example, loglikelihood similarity does something
> with 4 quantities of two items:
> item1item2   : # of users who rated both item1 and item2
> !item1item2  : # of users who rated item2 but not item1
> item1!item2  : # of users who rated item1 but not item2
> !item1!item2 : # of users who didn't rate item1 or item2
> It is easy to find these quantities with mapreduce. When I finish
> implementing it, I will share it with a patch.

Yes you can easily compute these offline, and then use them in an
alternate implementation of LogLikelihoodSimilarity to produce fast
results. The only possible issue I see is loading all these values
into memory, since it grows as the square of the number of items.

You could store only 'most-popular' pairs in memory and fall back to a
normal computation when the pair hasn't been precomputed. This amounts
to caching, and might be done as easily by simply writing a caching
wrapper around LogLikelihoodSimilarity, which caches with the "Cache"
class, which will do a nice job of limiting the memory usage and
removing unpopular entries.


> There is also another measure that we talked about in our discussion. Ted
> said dot product can yield very good results and it can be weighted with
> inverse user frequency.
> I've implemented 3 versions of that approach to see how good the results
> are:

Agree, this is what PearsonCorrelationSimilarity basically does --
really it's also an implementation of the cosine-measure similarity,
since the data is 'centered' to a mean of zero. And I believe this is
what Ted is referring to by dot product. The framework also already
has a transformation for inverse user frequency.

But yeah I think your point is doing this at scale, all at once, with
a big matrix operation, via mapreduce, for all pairs of items. I agree
this is a useful thing. The results, when precomputed this way, make
the on-line similarity computation very fast. You could feed it in to
any item-based recommender that way.

Or you could rewrite the rest of the recommender process in mapreduce,
continue that way.

> First, I have worked on 1 month user log of an e-commerce site, data set is
> very sparse, but there are some items that are very popular, that don't fit
> to the general characteristics of data.
>
> I've tried 3 versions of that computation:
> 1- iuf=log(N/1+uf)  ; N is total number of users here

Yep

> score item1,item2 = cosine similarity of item1 item2 vectors whose elements
> are weighted with iuf that is described above
>                                 the dot product of two
items is normalized
> with euclidian length of items(i.e.
> score=dotproduct/length(item1)*length(item2))

PS yeah this is the cosine measure similarity, and also reduces to
Pearson when the data has mean zero.


> 2-iuf is same as 1, score is just dot product

I think it's problematic to not normalize the dot product. I wouldn't
expect this to work well, but let's see your results below --


> 3- iuf=N/1+uf, and score is just dot product

Valid, but I do think the log() is useful to keep here.


> As a result, I think it will be good it will be good if we found a way to
> smooth item vectors to make 1st way yield better results. For example, with
> filling some missing data. And it will be good if we had a way to merge 2nd
> and 3rd ways. I mean, another similarity measure that find hidden -hard to
> find- similar items like 2, and gives the results as 3rd way also.

"Filling missing data" -- yes this is what the "PreferenceInferrer"
does in the framework. I am not convinced it is very useful, but lets
you synthesize missing data.

I also suspect 1 is the way to go. It is definitely the conventional
approach and the one already implemented by the framework.

Mime
View raw message