# hive-user mailing list archives

##### Site index · List index
Message view
Top
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: What's the best way to find the nearest neighbor in Hive? Any windowing function?
Date Wed, 14 Sep 2016 17:54:27 GMT
```It seems youâ€™ll have to go with JOIN.
Here are 2 options.

Dudu

select      t0.id
as id_0
,min (named_struct ("dist",abs((t1.price - t0.price)/100) + abs((t1.number - t0.number)/1000),"id",t1.id)).id
as id_1

from                t   as t0

join    t   as t1

on      t0.state    =
t1.state

and t0.city    =
t1.city

where       t0.flag = 0
and t1.flag = 1

group by    t0.id
;

select      t.id_0
,t.id_1

from       (select      t0.id
as id_0
,t1.id
as id_1
,row_number () over (partition by t0.id order by abs((t1.price - t0.price)/100)
+ abs((t1.number - t0.number)/1000)) as n

from                t   as t0

join    t   as t1

on      t0.state    =
t1.state

and t0.city    =
t1.city

where       t0.flag = 0
and t1.flag = 1
)
as t

where       n = 1
;

From: Mobius ReX [mailto:aoirex@gmail.com]
Sent: Tuesday, September 13, 2016 10:47 PM
To: user@hive.apache.org
Subject: What's the best way to find the nearest neighbor in Hive? Any windowing function?

Given a table

>     \$cat data.csv
>
>     ID,State,City,Price,Number,Flag
>     1,CA,A,100,1000,0
>     2,CA,A,96,1010,1
>     3,CA,A,195,1010,1
>     4,NY,B,124,2000,0
>     5,NY,B,128,2001,1
>     6,NY,C,24,30000,0
>     7,NY,C,27,30100,1
>     8,NY,C,29,30200,0
>     9,NY,C,39,33000,1

Expected Result:

ID0, ID1
1,2
4,5
6,7
8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with the same "State"
and "City", and the nearest Price and Number normalized by the corresponding values of that
ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different FLAG.
After normalized the Price and Number (Price divided by 100, Number divided by 1000), the
distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05

What's the best way to find such nearest neighbor in Hive? Can we use Lead/Lag or Rank for
this case? Any valuable tips will be greatly appreciated!
```
Mime
View raw message