Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/225#discussion_r161918108
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -326,6 +331,39 @@ Result, with neighbors sorted from closest to furthest:
(6 rows)
</pre>
+
+-# Run KNN for classification using the
+weighted average:
+<pre class="example">
+DROP TABLE IF EXISTS knn_result_classification;
+SELECT * FROM madlib.knn(
+ 'knn_train_data', -- Table of training data
+ 'data', -- Col name of training data
+ 'id', -- Col name of id in train data
+ 'label', -- Training labels
+ 'knn_test_data', -- Table of test data
+ 'data', -- Col name of test data
+ 'id', -- Col name of id in test data
+ 'knn_result_classification', -- Output table
+ 3, -- Number of nearest neighbors
+ True, -- True to list nearest-neighbors by id
+ 'madlib.squared_dist_norm2', -- Distance function
+ True -- For weighted average
+ );
+SELECT * FROM knn_result_classification ORDER BY id;
+</pre>
+<pre class="result">
+ id | data | prediction | k_nearest_neighbours
+----+---------+---------------------+----------------------
+ 1 | {2,1} | 2.2 | {1,2,3}
+ 2 | {2,6} | 0.425 | {3,4,5}
+ 3 | {15,40} | 0.0174339622641509 | {5,6,7}
+ 4 | {12,1} | 0.0379633360193392 | {3,4,5}
+ 5 | {2,90} | 0.00306428140577315 | {6,7,9}
+ 6 | {50,45} | 0.00214165229166379 | {6,7,8}
+(6 rows)
+</pre>
+
--- End diff --
I got the following error for this example (was running on Greenplum 5):
```
greenplum=# DROP TABLE IF EXISTS knn_result_classification;
NOTICE: table "knn_result_classification" does not exist, skipping
DROP TABLE
greenplum=# SELECT * FROM madlib.knn(
greenplum(# 'knn_train_data', -- Table of training data
greenplum(# 'data', -- Col name of training data
greenplum(# 'id', -- Col name of id in train data
greenplum(# 'label', -- Training labels
greenplum(# 'knn_test_data', -- Table of test data
greenplum(# 'data', -- Col name of test data
greenplum(# 'id', -- Col name of id in test data
greenplum(# 'knn_result_classification', -- Output table
greenplum(# 3, -- Number of nearest neighbors
greenplum(# True, -- True to list nearest-neighbors by
id
greenplum(# 'madlib.squared_dist_norm2', -- Distance function
greenplum(# True -- For weighted average
greenplum(# );
ERROR: plpy.SPIError: function expression in FROM cannot refer to other relations of
same query level
LINE 15: a , unnest(k_nearest_neighbours)...
^
QUERY:
CREATE TABLE knn_result_classification AS
SELECT id, data ,max(prediction) as prediction
, array_agg(distinct k_neighbours) AS k_nearest_neighbours
FROM
( SELECT __madlib_temp_test_id_temp29900589_1516144312_53639332__
AS id, data
,sum(1/dist) AS prediction
, array_agg(knn_temp.train_id ORDER BY knn_temp.dist ASC)
AS k_nearest_neighbours
FROM pg_temp.__madlib_temp_interim_table75130626_1516144312_10216040__
AS knn_temp
JOIN
knn_test_data AS knn_test ON
knn_temp.__madlib_temp_test_id_temp29900589_1516144312_53639332__
= knn_test.id
GROUP BY __madlib_temp_test_id_temp29900589_1516144312_53639332__
,
data, __madlib_temp_label_col_temp66682446_1516144312_5242078__)
a , unnest(k_nearest_neighbours) as k_neighbours
GROUP BY id, data
CONTEXT: Traceback (most recent call last):
PL/Python function "knn", line 36, in <module>
weighted_avg
PL/Python function "knn", line 242, in knn
PL/Python function "knn"
```
This might be because some functions/features available in Postgres-9.x are not available
in Greenplum. So we should use functions that would work on both.
---
|