mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Email and Collab. Filtering
Date Mon, 22 Aug 2011 14:48:28 GMT
I'm working on an example (well, examples) of using Mahout with the ASF Public Data Set up
on Amazon ( and I wanted to show how to use
the 3 "C's" (collab filtering, clustering, classification) with the data set.  Clustering
and classification are pretty straight forward, but I'm wondering about the setup around collaborative

The motivation for recommendations is pretty straightforward:  provide people recs on emails
that they might find useful based on what other people have interacted with.  The tricky part
is I am not totally sure on a valid setup of the problem.  My current thinking is that I build
up the rec. matrix based on whether someone has interacted with (initiated/replied) a thread
or not.  Thus, the columns are the thread ids and the rows are the users.  Each cell contains
the count of the number of times user X has interacted with thread Y.  This feels to me like
it is a stand in for that user's preference in that if they are replying multiple times, they
have an interest in that topic.  I have no idea if this will be effective or not, but it seems
like it could be interesting.  Does it sound reasonable?  I worry that even in a really large
data set as above it will simply be too sparse.

Is there a better way to think about this from a strict collaborative filtering context? 
In other words, I know I could do content-based recommendations but that is not what I am
after here.


Grant Ingersoll

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message