hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rex X <dnsr...@gmail.com>
Subject How to do such a join of Pairing in Hive?
Date Fri, 26 Aug 2016 00:56:28 GMT
1. Firstly we create a hive table by loading following csv file

>     $cat data.csv
>
>     ID,City,Zip,Flag
>     1,A,95126,0
>     2,A,95126,1
>     3,A,95126,1
>     4,B,95124,0
>     5,B,95124,1
>     6,C,95124,0
>     7,C,95127,1
>     8,C,95127,0
>     9,C,95127,1


(a) where "ID" above is a primary key (unique),

(b) for each "City" and "Zip" combination, there is one ID in max with
Flag=0; while it can contain multiple IDs with Flag=1 for each "City" and
"Zip" combination.

(c) Flag can be 0 or 1


2. Load the csv file above to a hive table

>     create table test(ID string, City String, Zip String, Flag int)
>     ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ;
>
>     LOAD DATA LOCAL INPATH "./data.csv" OVERWRITE INTO TABLE test;


3.
For each ID with Flag=0, we want to pair it with another ID with Flag=1 but
with the same City - Zip. If one cannot find another paired ID with Flag=1
and matched City - Zip, we just delete that record.

Here is the Expected result:

>     ID,City,Zip,Flag
>     1,A,95126,0
>     2,A,95126,1
>     4,B,95124,0
>     5,B,95124,1
>     7,C,95127,1
>     8,C,95127,0


Any valuable tips how to do this pairing in Hive or Python?

Great thanks!

Rex

Mime
View raw message