Hi Ted,
I think it will be interesting to have such a code in Mahout.
The first part of it would be to compute the matrix of cooccurrence
counts and the next part would be to compute the loglikelihood scores.
Should we open a JIRA wherein you can provide a worked example and we
can take it from there?
Thanks
Ankur
Original Message
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Tuesday, January 20, 2009 12:16 PM
To: mahoutdev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT19) Hierarchial clusterer
I can supply code for computing the measure itself, but not for the
mapreduce computation of the counts involved.
In my experience, this only requires about 1015 lines of pig but rather
a
larger amount of native mapreduce code. At Veoh, we used this and
other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation. Even with a
relatively small cluster, this is not an extremely long computation.
The four inputs to the loglikelihood ratio test for independence are
all
counts. For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item. To minimize issues with
click
spam it is customary to count only one interaction per user so all of
the
counts can be considered a count of users rather than events.
If you view your set of of histories to be a binary matrix H containing
rows
that correspond to users and columns that correspond to items, then H' H
is
the matrix of coocurrence counts for all possible A's and B's. Columns
of
H' H provide information needed to get the AnotB and BnotA counts
and
the total of the matrix gives the information for the the notAnotB
counts.
This matrix multiplication is, in fact, the same as a join.
I have a blog posting on the subject of computing loglikelihood ratios
here:
http://tdunning.blogspot.com/2008/03/surpriseandcoincidence.html
If need be, I can add a worked example of how to compute cooccurrence
using
mapreduce.
On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur
<ankur.goel@corp.aol.com>wrote:
> About Tanimoto measure, I thought of using it in hierarchical
clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the loglikelihood of cooccurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the loglikelihood approach. Do we have the mapred version
of
> loglikelihood code in Mahout?
>
> Ted, any thoughts?
>

Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
6503240110, ext. 738
8584140013 (m)
