mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [Taste] Sanity Check and Questions
Date Fri, 19 Jun 2009 16:54:16 GMT
If you have a history corpus of users and items represented as a binary
matrix A, then A' A is the coocurrence of items within user histories.

You can construct a 2x2 contingency table for each non-zero element of K =
A' A and select only items that have large scores on the LLR statistic.  The
contingency table for item i versus item j is formed using the cooccurrence
of i and j (k_ij), the number of occurrences of i without j (rowsum(K)_i -
k_ij), the number of occurrences of j without i (colsum(K) - k_ij) and the
number of non-i, non-j cooccurrences (sum(K) - rowsum(K)_i - colsum(K)_j +
k_ij).  Of course K is symmetric so the row and column sums are
interchangeable.

Let K* be the filtered version of K that only contains non-zero elements
where we found a large score for the contingency table.  Each row i of K* is
a recommendation set for item i.  Recommendation sets can be combined by
summing weighted rows of K*

Another way to look at this is that retrieving a set of related users for a
user with history h is very similar to a matrix vector multiplication:

             z = related users = A h

Then finding items that these users related to is another multiply:

            r = recommendation = A' (A h)

In this sort of implementation, K will be appropriately weighted to avoid
disaster and you probably will need some additional sparsification to make
it at all reasonable.  The cooccurrence approach is nearly equivalent to
noting that

            A' (A h) = (A' A) h

where A' A can be computed off-line.  To make this work well, as I
mentioned, it is typical to work with a weighted version of A.  The use of
an un-weighted version that is sparsified by the LLR test has some real
benefits in that the unweighted computation of A' A involves only counting
of binary values which can be made pretty darned fast.

In my experience, negative ratings are highly misleading since people tend
not to down-rate things that they don't like, but rather only down-rate
things that are very close to the things that they do like but which they
have some negative animus about.   Imagine, for example a universe
consisting of Gurnsey, Jersey, Hostein and Angus cows and various Intel, AMD
and ARM cpu chips.  Suppose that we have two users, one a geek who hates AMD
and has made one negative rating on AMD and no positive ratings and an Amish
farmer who cannot abide Angus cows and has rated them negatively.
Recommending cows to a computer geek is worse than recommending AMD articles
even though they might have marked AMD items negatively in the past.
Likewise, sending a traditional Amish farmer announcements about the latest
low power CPU's is probably not as useful as sending him articles about
Angus cows.

Besides, negative ratings are much more rare than positive ratings.  That
said, you can handle negative ratings by using two binary matrices A+ and A-
(imagine the + and - are superscripted) that represent the positive ratings
and negative ratings.  Each of these can be analyzed for cooccurrence as
described above and you can get a positive recommendation and a negative
recommendation.

See http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for
more details about the computations.

What is the co-occurrence approach Ted?




-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message