I'm doing this here for multiple tens of millions of elements (and the goal is to reach multiple billions), on a relatively small cluster (7 nodes 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a Kmeans on your data, then compute the distance between each element with each cluster center, keep a few clusters and only look into these clusters for nearest neighbours.

This method is known to perform very well and vastly speedup your computation

The hardest part is to decide how many clusters to compute, and how many to keep. As a rule of thumb, I generally want 300-10000 elements per cluster, and use 5-20 clusters.


I am implementing an algorithm using Spark. I have one million users. I need to compute the similarity between each pair of users using some user's attributes.  For each user, I need to get top k most similar users. What is the best way to implement this?  


Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05