flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Blazevski (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
Date Wed, 06 Apr 2016 03:08:25 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227632#comment-15227632
] 

Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:07 AM:
-----------------------------------------------------------------

[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up and running.
 The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g
20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of
this and the exact decides when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training points are
increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =       : 2s
Elapsed time exact =       : 27s

64,000 in 6D:  
Elapsed time approx =       : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the run-time a bit,
but will definitely not increase the complexity of the algorithm with respect to the number
of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if that's possible.
 Any thoughts?  Basically the idea is to hash the test/train set to 1D (I use the z-value
hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is quite complex.
 They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff.
 One option could be that I clean up what I have now and optimize since it's already performing
well, and we open a new issue for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not add in computational
complexity with respect to the number of test points.  e.g. I now convert the coordinates
to integers and ignore the decimal part and there are now lots of collisions in the z-value
hash, normalizing the data and adding a fixed max number of bits to compute the z-value is
needed, but will definitely not increase the complexity with respect to adding more test/training
points (this is described towards the end of [3])

Any thoughts?


was (Author: danielblazevski):
[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up and running.
 The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g
20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of
this and the exact decides when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training points are
increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =       : 2s
Elapsed time exact =       : 27s

64,000 in 6D:  
Elapsed time approx =       : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the run-time a bit,
but will definitely not increase the complexity of the algorithm with respect to the number
of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if that's possible.
 Any thoughts?  Basically the idea is to hash the test/train set to 1D (I use the z-value
hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is quite complex.
 They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff.
 One option could be that I clean up what I have now and optimize since it's already performing
well, and we open a new issue for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not add in computational
complexity with respect to the number of test points.  e.g. I now convert the coordinates
to integers and ignore the decimal part and there are now lots of collisions in the z-value
hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this
is described towards the end of [3])

Any thoughts?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-1934
>                 URL: https://issues.apache.org/jira/browse/FLINK-1934
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Daniel Blazevski
>              Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. However, due
to the computational costs of an exact implementation, it does not scale well to large amounts
of data. Therefore, it is worthwhile to also add an approximative kNN implementation as proposed
in [1,2].  Reference [3] is cited a few times in [1], and gives necessary background on the
z-value approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message