spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin East <>
Subject Re: Build k-NN graph for large dataset
Date Wed, 26 Aug 2015 11:51:08 GMT
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you
could successfully compute similarities in the high-dimensional space you would probably run
into the curse of dimensionality.
> On 26 Aug 2015, at 12:35, Jaonary Rabarisoa <> wrote:
> Dear all,
> I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely,
I have a large set of high dimensional vector (say d >>> 10000) and I want to build
a graph where those high dimensional points are the vertices and each one is linked to the
k-nearest neighbor based on some kind similarity defined on the vertex spaces. 
> My problem is to implement an efficient algorithm to compute the weight matrix of the
graph. I need to compute a N*N similarities and the only way I know is to use "cartesian"
operation follow by "map" operation on RDD. But, this is very slow when the N is large. Is
there a more cleaver way to do this for an arbitrary similarity function ? 
> Cheers,
> Jao

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message