spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Re: Huge matrix
Date Fri, 11 Apr 2014 22:24:03 GMT
The naive way would be to put all the users and their attributes into an
RDD, then cartesian product that with itself.  Run the similarity score on
every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
take the .top(k) for each user.

I doubt that you'll be able to take this approach with the 1T pairs though,
so it might be worth looking at the literature for recommender systems to
see what else is out there.

On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <> wrote:

> Hi all,
> I am implementing an algorithm using Spark. I have one million users. I
> need to compute the similarity between each pair of users using some user's
> attributes.  For each user, I need to get top k most similar users. What is
> the best way to implement this?
> Thanks.

View raw message