spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin East <robin.e...@xense.co.uk>
Subject Re: Build k-NN graph for large dataset
Date Wed, 26 Aug 2015 11:51:08 GMT
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you
could successfully compute similarities in the high-dimensional space you would probably run
into the curse of dimensionality.
> On 26 Aug 2015, at 12:35, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
> 
> Dear all,
> 
> I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely,
I have a large set of high dimensional vector (say d >>> 10000) and I want to build
a graph where those high dimensional points are the vertices and each one is linked to the
k-nearest neighbor based on some kind similarity defined on the vertex spaces. 
> My problem is to implement an efficient algorithm to compute the weight matrix of the
graph. I need to compute a N*N similarities and the only way I know is to use "cartesian"
operation follow by "map" operation on RDD. But, this is very slow when the N is large. Is
there a more cleaver way to do this for an arbitrary similarity function ? 
> 
> Cheers,
> 
> Jao


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message