mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Compute similarities for an hudge quantity of data
Date Mon, 06 Jul 2009 23:42:47 GMT
Do you need to compute the similarity between all pairs of users in
order to measure similarity between any two users? no, not at all.
There are several implementations of UserSimilarity and in general
they will only look at the data associated to the two users being
compared, not all users.

Computing a neighborhood is different. There, in theory, you do need
to compute the similarity between one user, and all other users (but
still, not all pairs), and pick some set of most-similar users. (And
there are optimizations -- for example, you could sample 10% of all
other users to form a "pretty good" neighborhood rather than actually
look at everyone else.)

You bring up clustering. Indeed that is one approach. You start by
clustering users -- basically, making a bunch of disjoint
neighborhoods ahead of time -- and then recommending from within the
cluster. You can do that somewhat more efficiently than looking at all
pairs, still. See TreeClusteringRecommender.

Yes, anything that requires looking at all pairs of users could be
disastrously slow.

If you have a lot of users, but few items, consider using an
item-based recommender instead. This would scale better.

On Tue, Jul 7, 2009 at 12:36 AM, charlysf<charles.ruelle@gmail.com> wrote:
>
> Hello,
>
> I currently working on a small database, I understand that, when I need the
> similarity between users, it's basically the compute between all pairs of
> users.
>
> It's that ? or it's better ?
> If it's that, how can I expect a quick compute for 1 million rows ?
>
> I don't see what is the difference between asking for the neighborhood, to
> compute the similarity for all pairs of users.
>
> Because I thought, something could be interesting :
> Make some clusters of users, and only compute the similarity between users
> in my cluster.
>
> Thanks
> --
> View this message in context: http://www.nabble.com/Compute-similarities-for-an-hudge-quantity-of-data-tp24364711p24364711.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

Mime
View raw message