hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rex X <dnsr...@gmail.com>
Subject What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?
Date Fri, 22 Jan 2016 17:50:43 GMT
The two SequentialTextFiles correspond to two Hive tables, say tableA and
tableB below on

    hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
and
    hdfs://hive/tableB/YYYY/MM/DD/*/part-00000

Both of them are partitioned by date, for example,

    hdfs://hive/tableA/2016/01/01/*/part-00000

Now we want to do a left outer join on tableA.id=tableB.id, for a date
range, for example, from 2015/12/01 to 2016/01/09.

Within Hive it is pretty easy

    select * from tableA a left outer join tableB b
    on a.id=b.id
    where a.dt is between '20151201' and '20160109'
    and b.dt is between '20151201' and '20160109';


What's the best way to do Outer join and Inner join of these two
SequentialTextFiles using Hadoop streaming and Python ?

Any comments will be appreciated!

Mime
View raw message