mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <ankur.g...@corp.aol.com>
Subject RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
Date Tue, 20 Jan 2009 10:27:24 GMT
Hi Ted,
        I think it will be interesting to have such a code in Mahout.
The first part of it would be to compute the matrix of co-occurrence
counts and the next part would be to compute the log-likelihood scores.

Should we open a JIRA wherein you can provide a worked example and we
can take it from there?

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 12:16 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

I can supply code for computing the measure itself, but not for the
map-reduce computation of the counts involved.

In my experience, this only requires about 10-15 lines of pig but rather
a
larger amount of native map-reduce code.  At Veoh, we used this and
other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation.  Even with a
relatively small cluster, this is not an extremely long computation.

The four inputs to the log-likelihood ratio test for independence are
all
counts.   For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item.  To minimize issues with
click
spam it is customary to count only one interaction per user so all of
the
counts can be considered a count of users rather than events.

If you view your set of of histories to be a binary matrix H containing
rows
that correspond to users and columns that correspond to items, then H' H
is
the matrix of coocurrence counts for all possible A's and B's.  Columns
of
H' H provide information needed to get the A-not-B and B-not-A counts
and
the total of the matrix gives the information for the the not-A-not-B
counts.

This matrix multiplication is, in fact, the same as a join.

I have a blog posting on the subject of computing log-likelihood ratios
here:

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

If need be, I can add a worked example of how to compute co-occurrence
using
map-reduce.


On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur
<ankur.goel@corp.aol.com>wrote:

> About Tanimoto measure, I thought of using it in hierarchical
clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version
of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>



-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Mime
View raw message