On Wed, Oct 28, 2009 at 3:06 PM, Gökhan Çapan <gkhncpn@gmail.com> wrote:
> similarities at once. For example, loglikelihood similarity does something
> with 4 quantities of two items:
> item1item2 : # of users who rated both item1 and item2
> !item1item2 : # of users who rated item2 but not item1
> item1!item2 : # of users who rated item1 but not item2
> !item1!item2 : # of users who didn't rate item1 or item2
> It is easy to find these quantities with mapreduce. When I finish
> implementing it, I will share it with a patch.
Yes you can easily compute these offline, and then use them in an
alternate implementation of LogLikelihoodSimilarity to produce fast
results. The only possible issue I see is loading all these values
into memory, since it grows as the square of the number of items.
You could store only 'mostpopular' pairs in memory and fall back to a
normal computation when the pair hasn't been precomputed. This amounts
to caching, and might be done as easily by simply writing a caching
wrapper around LogLikelihoodSimilarity, which caches with the "Cache"
class, which will do a nice job of limiting the memory usage and
removing unpopular entries.
> There is also another measure that we talked about in our discussion. Ted
> said dot product can yield very good results and it can be weighted with
> inverse user frequency.
> I've implemented 3 versions of that approach to see how good the results
> are:
Agree, this is what PearsonCorrelationSimilarity basically does 
really it's also an implementation of the cosinemeasure similarity,
since the data is 'centered' to a mean of zero. And I believe this is
what Ted is referring to by dot product. The framework also already
has a transformation for inverse user frequency.
But yeah I think your point is doing this at scale, all at once, with
a big matrix operation, via mapreduce, for all pairs of items. I agree
this is a useful thing. The results, when precomputed this way, make
the online similarity computation very fast. You could feed it in to
any itembased recommender that way.
Or you could rewrite the rest of the recommender process in mapreduce,
continue that way.
> First, I have worked on 1 month user log of an ecommerce site, data set is
> very sparse, but there are some items that are very popular, that don't fit
> to the general characteristics of data.
>
> I've tried 3 versions of that computation:
> 1 iuf=log(N/1+uf) ; N is total number of users here
Yep
> score item1,item2 = cosine similarity of item1 item2 vectors whose elements
> are weighted with iuf that is described above
> the dot product of two
items is normalized
> with euclidian length of items(i.e.
> score=dotproduct/length(item1)*length(item2))
PS yeah this is the cosine measure similarity, and also reduces to
Pearson when the data has mean zero.
> 2iuf is same as 1, score is just dot product
I think it's problematic to not normalize the dot product. I wouldn't
expect this to work well, but let's see your results below 
> 3 iuf=N/1+uf, and score is just dot product
Valid, but I do think the log() is useful to keep here.
> As a result, I think it will be good it will be good if we found a way to
> smooth item vectors to make 1st way yield better results. For example, with
> filling some missing data. And it will be good if we had a way to merge 2nd
> and 3rd ways. I mean, another similarity measure that find hidden hard to
> find similar items like 2, and gives the results as 3rd way also.
"Filling missing data"  yes this is what the "PreferenceInferrer"
does in the framework. I am not convinced it is very useful, but lets
you synthesize missing data.
I also suspect 1 is the way to go. It is definitely the conventional
approach and the one already implemented by the framework.
