hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mobius ReX <aoi...@gmail.com>
Subject What's the best way to find the nearest neighbor in Hive? Any windowing function?
Date Tue, 13 Sep 2016 19:46:54 GMT
Given a table

>     $cat data.csv
>
>     ID,State,City,Price,Number,Flag
>     1,CA,A,100,1000,0
>     2,CA,A,96,1010,1
>     3,CA,A,195,1010,1
>     4,NY,B,124,2000,0
>     5,NY,B,128,2001,1
>     6,NY,C,24,30000,0
>     7,NY,C,27,30100,1
>     8,NY,C,29,30200,0
>     9,NY,C,39,33000,1


Expected Result:

    ID0, ID1
    1,2
    4,5
    6,7
    8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with
the same "State" and "City", and the nearest Price and Number normalized by
the corresponding values of that ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different
FLAG.
After normalized the Price and Number (Price divided by 100, Number divided
by 1000), the distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05


What's the best way to find such nearest neighbor in Hive? Can we use
Lead/Lag or Rank for this case? Any valuable tips will be greatly
appreciated!

Mime
View raw message