# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: LogLikelihoodSimilarity calculation
Date Sun, 14 Sep 2014 14:49:45 GMT
```They key phrase is frequency. If we use the search engine method for returning recs we are
using LLR to find significant cooccurrencesâ€”as a filter. Then the final step of indexing
and querying for recs they are (or can be) TF-IDF weighted and cosine similarity is used for
the query. This will down-weight universally popular items. If, for some reason, you want
to favor blockbusters turn off norms and TF-IDF to leave the high weights for popular items.

On Sep 13, 2014, at 12:28 AM, mario.alemi@gmail.com wrote:

Hi All

One consideration. If we assume that all books have the same probability of
being bought, K11=90 has no significance for the recommendation, as
rightfully comes out from LLR. The probability of having a certain K11 from
the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
90. So the likelihood that the 90 books in common are there by chance is at
its maximum.

K11=10 is, on the contrary, significative. It says that users 1 and 2 are
*dissimilar*. The probability of user 2 picking only 10 books in common
with user 1 is actually quite low (binomially, 10 success out of 100 trials
with p=90%, i.e. ~E-78).

LLR makes this clear too. Still, I wonder if it is possible in some way to
take in consideration that *not all books have the same probability of
being bought*. The Pareto-nature of book selling makes so that if User2
buys 100 books randomly, according to their sales-frequency, it would most
probably take 100 best sellers -and not the long tail and would result with
a computable (but unknown to me) number of books in common with User1.
Therefore it is important not just to consider how many books User1 has in
common with User2, but also *which* books are in common. For two users with
100 books, having 10 blockbusters in common is not significative, having
even 10 extremely rare books in common is quite significative. But this
would not come out from the computation of the LLR for two users. It does
come out when we compute the similarity between two books -but then, in
this case, we do not consider the amount of books bought by each one of the
users who bought the two books.

(Ted, I am sorry to bring this topic up again, after the comment on your
blog, but every time I use the LLR -for item or user similarity- this
question always comes up in my mind, and I cannot see in the formulae how

Regards,
Mario

On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> It might help to look at the matrices that result:
>
> First I defined a function in R to generate the contingency tables:
>
>> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
> n-n1-n2+k11),
> nrow=2)}
>
> One of your examples is this one
>
>> f(90)
>     [,1]   [,2]
> [1,]   90 899910
> [2,]   10  99990
>
> Notice how the two columns are basically the same except for a scaling
> factor.
>
> Here is your other example
>
>> f(10)
>     [,1]   [,2]
> [1,]   10 899990
> [2,]   90  99910
>
> Now what we have is that in the first column, row 2 is bigger while in the
> second column, row 1 is bigger.  That is, the distributions are quite
> different.
>
> Here is the actual LLR score for the first example:
>
>> llr(f(90))
> [1] -1.275022e-10
>
> (the negative sign is spurious and hte result of round-off error.  The real
> result is basically just 0)
>
> And for the second:
>
>> llr(f(10))
> [1] 351.6271
>
> Here we see a huge value which says that (as we saw), the distributions are
> different.
>
> For reference, here is the R code for llr:
>
>> llr
> function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
>> H
> function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
>
>
> On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> aishwarya.srivastava@bloomreach.com> wrote:
>
>> Hi Dmitriy,
>>
>> I am following the same calculation used in the userSimilarity method in
>> LogLikelihoodSimilarity.java
>>
>> k11 = intersectionSize       (both users viewed movie)
>>
>> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
>>
>> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
>>
>> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed
> by
>> both 1 and 2)
>>
>>
>> Thanks,
>>
>> Aishwarya
>>
>> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>>
>>> how do you compute k11, k12... values exactly?
>>>
>>> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
>>> aishwarya.srivastava@bloomreach.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have the following case where numItems = 1,000,000, prefs1Size =
>>> 900,000
>>>> and prefs2Size = 100.
>>>>
>>>> It is the case when i have two users, one who has seen 90% of the
>> movies
>>> in
>>>> the database and another only 100 of the million movies. Suppose they
>>> have
>>>> 90 movies in common (user 2 has seen only 100 movies totally), i
> would
>>>> assume the similarity to be high compared to when they have only 10
>>> movies
>>>> in common. But the similarities i am getting are
>>>> 0.9971 for intersection size 10 and
>>>> 0 for intersection size 90.
>>>>
>>>> This seems counter intuitive.
>>>>
>>>> Am i missing something? Is there an explanation for the above
> mentioned
>>>> values?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>
>>
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>
>>>
>>
>

```
Mime
View raw message