mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mario.al...@gmail.com
Subject Re: LogLikelihoodSimilarity calculation
Date Fri, 19 Sep 2014 10:29:31 GMT
Hi All

Pat and Ted, thank you very much for your answers! As usual, much
appreciated.

I see that the more I write, the more useless statements I make -such as
"[User] would most probably take 100 best sellers...". Ted, you are
obviously right, it's not the case.

Regarding the second point, it is not that I think that high overlap causes
problems. I just would like to compute the likelihood that a certain
overlap is obtained by chance or because the two users are similar or
dissimilar.  This was my concern computing similarities between crowd-based
book lists. Some lists had only very few books, and some of those lists had
books in common: they were the "intellectual" lists with rare, highly
informative books, therefore even a few books in common were significative.
Other lists, with hundreds of books (e.g. the "free ebooks" list), had the
same books in common, but in this case the similarity weight given by the
same "rare" books had to be lower, because the lists were different -they
were "pop" lists.

So my question was -shouldn't we consider both the frequency distribution
of item sales *and* of users purchases in the same formula? Am I correct if
I say that this does not happen when we compute the contingency table (if
we build the contingency table for two users, we do not consider the
frequency distribution of book sales, and vice versa), right?

That said, I am fully aware that mine is a mainly academic question, as the
LLR makes anyway a fantastic job....!

Thanks again for your time (and for doing such a great job with Mahout to
Spark:) )
Mario



On Sun, Sep 14, 2014 at 8:01 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Mario,
>
> Your questions are good.  And the answers, such as they are, bear repeating
> and elaboration.
>
> I see several basic points in what you write.  Selecting from this, I would
> highlight the following:
>
> 1) random-walk users picking items according to overall frequency will tend
> to pick the same very popular items
>
> 2) high overlap of this kind will cause problems with recommendation
>
>
> To the first point, the probability that an undirected user picks only
> common items is actually quite low.  The key property of a long-tailed
> frequency distribution is exactly that the long-tail has very significant
> probability mass.  The high-frequency head items have significant mass as
> well, but they do not dominate.  Any user selecting 100 items will
> necessary pick some high frequency items, some mid-frequency items and some
> rare (i.e. long-tail) items.
>
> With respect to the second point, I would point out that the premise is
> flawed so the question is already addressed.  But even if we consider
> further, there are two issues with the assertion.  The first and most
> significant is the unstated assumption that overlap between user histories
> is what we are looking for.  In fact, with LLR, we are not doing that at
> all.  We are looking at overlap of users in the item history, corrected
> according to underlying frequency.  This means that even if 80 or 90% of
> the items selected by our random walk users are from the same small set of
> high-frequency items (and they will not be), then we still don't really
> have a problem.  It just means that we will be spending too many cycles on
> the analysis of things we could find out quickly.
>
> The second issue with this is that even for the high frequency items, if
> they have not tendency toward cooccurrence beyond what is expected by their
> high underlying frequency, the LLR score will not be unusually high.  Now,
> high frequency items often do have small correlations in their occurrences
> and since they are abnormally sensitive by virtue of their high frequency,
> this can lead to a few items being commonly marked as indicators for
> others.
>
> This is also not much of a problem because the search engine will
> down-weight such common indicators.
>
> Does this help?  I know that I cherry-picked those of your questions that I
> have the strongest answers for, but it seems to me that they are also the
> most fundamental questions as well.
>
>
>
>
> On Sat, Sep 13, 2014 at 12:28 AM, <mario.alemi@gmail.com> wrote:
>
> > Hi All
> >
> > One consideration. If we assume that all books have the same probability
> of
> > being bought, K11=90 has no significance for the recommendation, as
> > rightfully comes out from LLR. The probability of having a certain K11
> from
> > the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
> > 90. So the likelihood that the 90 books in common are there by chance is
> at
> > its maximum.
> >
> > K11=10 is, on the contrary, significative. It says that users 1 and 2 are
> > *dissimilar*. The probability of user 2 picking only 10 books in common
> > with user 1 is actually quite low (binomially, 10 success out of 100
> trials
> > with p=90%, i.e. ~E-78).
> >
> > LLR makes this clear too. Still, I wonder if it is possible in some way
> to
> > take in consideration that *not all books have the same probability of
> > being bought*. The Pareto-nature of book selling makes so that if User2
> > buys 100 books randomly, according to their sales-frequency, it would
> most
> > probably take 100 best sellers -and not the long tail and would result
> with
> > a computable (but unknown to me) number of books in common with User1.
> > Therefore it is important not just to consider how many books User1 has
> in
> > common with User2, but also *which* books are in common. For two users
> with
> > 100 books, having 10 blockbusters in common is not significative, having
> > even 10 extremely rare books in common is quite significative. But this
> > would not come out from the computation of the LLR for two users. It does
> > come out when we compute the similarity between two books -but then, in
> > this case, we do not consider the amount of books bought by each one of
> the
> > users who bought the two books.
> >
> > (Ted, I am sorry to bring this topic up again, after the comment on your
> > blog, but every time I use the LLR -for item or user similarity- this
> > question always comes up in my mind, and I cannot see in the formulae how
> > it is addressed)
> >
> > Regards,
> > Mario
> >
> >
> >
> > On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > It might help to look at the matrices that result:
> > >
> > > First I defined a function in R to generate the contingency tables:
> > >
> > > > f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11,
> > n2-k11,
> > > n-n1-n2+k11),
> > > nrow=2)}
> > >
> > > One of your examples is this one
> > >
> > > > f(90)
> > >      [,1]   [,2]
> > > [1,]   90 899910
> > > [2,]   10  99990
> > >
> > > Notice how the two columns are basically the same except for a scaling
> > > factor.
> > >
> > > Here is your other example
> > >
> > > > f(10)
> > >      [,1]   [,2]
> > > [1,]   10 899990
> > > [2,]   90  99910
> > >
> > > Now what we have is that in the first column, row 2 is bigger while in
> > the
> > > second column, row 1 is bigger.  That is, the distributions are quite
> > > different.
> > >
> > > Here is the actual LLR score for the first example:
> > >
> > > > llr(f(90))
> > > [1] -1.275022e-10
> > >
> > > (the negative sign is spurious and hte result of round-off error.  The
> > real
> > > result is basically just 0)
> > >
> > > And for the second:
> > >
> > > > llr(f(10))
> > > [1] 351.6271
> > >
> > > Here we see a huge value which says that (as we saw), the distributions
> > are
> > > different.
> > >
> > > For reference, here is the R code for llr:
> > >
> > > > llr
> > > function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> > > > H
> > > function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
> > >
> > >
> > > On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> > > aishwarya.srivastava@bloomreach.com> wrote:
> > >
> > > > Hi Dmitriy,
> > > >
> > > > I am following the same calculation used in the userSimilarity method
> > in
> > > > LogLikelihoodSimilarity.java
> > > >
> > > > k11 = intersectionSize       (both users viewed movie)
> > > >
> > > > k12 = prefs2Size - intersectionSize   (only viewed by user 2)
> > > >
> > > > k21 = prefs1Size - intersectionSize    (only viewed by user 1)
> > > >
> > > > k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not
> viewed
> > > by
> > > > both 1 and 2)
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Aishwarya
> > > >
> > > > On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > how do you compute k11, k12... values exactly?
> > > > >
> > > > > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > > > > aishwarya.srivastava@bloomreach.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have the following case where numItems = 1,000,000, prefs1Size
> =
> > > > > 900,000
> > > > > > and prefs2Size = 100.
> > > > > >
> > > > > > It is the case when i have two users, one who has seen 90% of
the
> > > > movies
> > > > > in
> > > > > > the database and another only 100 of the million movies. Suppose
> > they
> > > > > have
> > > > > > 90 movies in common (user 2 has seen only 100 movies totally),
i
> > > would
> > > > > > assume the similarity to be high compared to when they have
only
> 10
> > > > > movies
> > > > > > in common. But the similarities i am getting are
> > > > > > 0.9971 for intersection size 10 and
> > > > > > 0 for intersection size 90.
> > > > > >
> > > > > > This seems counter intuitive.
> > > > > >
> > > > > > Am i missing something? Is there an explanation for the above
> > > mentioned
> > > > > > values?
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > > > > Sent from the Mahout User List mailing list archive at
> Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message