Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 60FD819F78 for ; Wed, 6 Apr 2016 03:08:26 +0000 (UTC) Received: (qmail 72806 invoked by uid 500); 6 Apr 2016 03:08:26 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 72729 invoked by uid 500); 6 Apr 2016 03:08:25 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 72690 invoked by uid 99); 6 Apr 2016 03:08:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Apr 2016 03:08:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 70ED52C14F6 for ; Wed, 6 Apr 2016 03:08:25 +0000 (UTC) Date: Wed, 6 Apr 2016 03:08:25 +0000 (UTC) From: "Daniel Blazevski (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227632#comment-15227632 ] Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:07 AM: ----------------------------------------------------------------- [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points in 6D (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000 in 6D: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value is needed, but will definitely not increase the complexity with respect to adding more test/training points (this is described towards the end of [3]) Any thoughts? was (Author: danielblazevski): [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points in 6D (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000 in 6D: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3]) Any thoughts? > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library > ---------------------------------------------------------------------------------- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Daniel Blazevski > Labels: ML > > kNN is still a widely used algorithm for classification and regression. However, due to the computational costs of an exact implementation, it does not scale well to large amounts of data. Therefore, it is worthwhile to also add an approximative kNN implementation as proposed in [1,2]. Reference [3] is cited a few times in [1], and gives necessary background on the z-value approach. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)