You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you
could successfully compute similarities in the highdimensional space you would probably run
into the curse of dimensionality.
> On 26 Aug 2015, at 12:35, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
>
> Dear all,
>
> I'm trying to find an efficient way to build a kNN graph for a large dataset. Precisely,
I have a large set of high dimensional vector (say d >>> 10000) and I want to build
a graph where those high dimensional points are the vertices and each one is linked to the
knearest neighbor based on some kind similarity defined on the vertex spaces.
> My problem is to implement an efficient algorithm to compute the weight matrix of the
graph. I need to compute a N*N similarities and the only way I know is to use "cartesian"
operation follow by "map" operation on RDD. But, this is very slow when the N is large. Is
there a more cleaver way to do this for an arbitrary similarity function ?
>
> Cheers,
>
> Jao

To unsubscribe, email: userunsubscribe@spark.apache.org
For additional commands, email: userhelp@spark.apache.org
