mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
Date Sat, 17 Jan 2009 13:24:51 GMT
At first this indeed sounds like a CF problem. You can use clustering to
solve a CF problem (see for instance TreeClusteringRecommender)

But you could use other algorithms just as well - see any of the other
Recommenders. You don't have ratings for a URL, just a binary 'yes, visited'
or nothing. You can take advantage of that by using the 'Boolean*' classes
and the Tanimoto similarity metric.

This doesn't capture the fact that there is an ordering that is important -
URL A was clicked just before B so when I am on A we should recommend B (but
not necessarily the reverse). To capture this I think you want to try an
item-based recommender with an item-item similarity that captures this
relation. It won't be symmetric which messes up some other things - may need
more tweaking of existing code to get right.

But then again is this a CF problem? Sounds like markov chains... given the
last 1 or 2 or 3 URLs visited, which URL has been next, most often? I think
that's relatively easy and fast, does that work?

As for data I would indeed consider throwing out data you believe is just


On 16 Jan 2009, 12:25 PM, "Goel, Ankur" <> wrote:

Ted / Karl, Thank you both for your comments and suggestions. Continuing
on the comments from Ted...

The end goal is definitely not clustering but rather recommendations.
Thist can be broken down into 2 separate tasks typical to a
recommendation engine.
1. Given a URL show other URLs people have liked.
2. Given a User session and the URL he is seeing, suggest other URLs he
might like.

I experimented a bit with clustering but couldn't get good
>From your advice Log-likelihood ratio sounds like a potential solution
for the first one. I remember having a discussion with you and Sean long
time back where you pointed to a useful paper

Please pardon me if I am asking the question again but do you think it's
a promising approach for problem 1? Do we have an implementation for
this in Mahout? If no then I can open it and work on it (given my other
work commitments allow enough time). Also since I have no formal
statistics background, I am working on 'rebuilding' my statistics
knowledge so that I can grasp these concepts better.

As for the data rates, I really don't know in the context of these
techniques what's low and what's high but what I have learnt after
accumulating weeks of data is that there are few users who have good
engagement (sufficient clicks) over a period of time, moderate number of
users who have small number of clicks and large number of users that
have very few clicks and are just casual surfers.

Also regarding building a user model as a simple mixture, I am not sure
which one you are referring to. Is it the LDA JIRA that Jeff is working

Once again thanks for all the help, much appreciated.


-----Original Message----- From: Ted Dunning []
Sent: Thursday, Januar...

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message