spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Pitel <guillaume.pi...@exensa.com>
Subject Re: Huge matrix
Date Sat, 12 Apr 2014 14:46:25 GMT
Hi,

I'm doing this here for multiple tens of millions of elements (and the goal is
to reach multiple billions), on a relatively small cluster (7 nodes 4 cores 32GB
RAM). We use multiprobe KLSH. All you have to do is run a Kmeans on your data,
then compute the distance between each element with each cluster center, keep a
few clusters and only look into these clusters for nearest neighbours.

This method is known to perform very well and vastly speedup your computation

The hardest part is to decide how many clusters to compute, and how many to
keep. As a rule of thumb, I generally want 300-10000 elements per cluster, and
use 5-20 clusters.

Guillaume 
>
> I am implementing an algorithm using Spark. I have one million users. I need
> to compute the similarity between each pair of users using some user's
> attributes.  For each user, I need to get top k most similar users. What is
> the best way to implement this?  
>
>
> Thanks.


-- 
eXenSa

	
*Guillaume PITEL, Président*
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05


Mime
View raw message