hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject best way to join?
Date Mon, 27 Aug 2012 20:15:32 GMT
Dear list,

Lets say i have a file, like this:
id \t at,tlng <-- structure

lets call it: 'points.txt'
I'm trying to build a map-reduce job that runs over this BIG points file
and it should output
a file, that will look like:
id[lat,lng] \t [list of points in JSON standard] <--- structure


Basically it should run on ITSELF, and grab for each point the N (it will
be an argument for the job) CLOSEST points (the mappers should calculate
the distance)..

Distributed cache is not an option, what else?  not sure if to classify it
as a map-join , reduce-join or both?
Would you do this in HIVE some how?
Is it feasible in a single job?

Would LOVE to hear your suggestions, code (if you think its not that hard)
or what not.
BTW using CDH3 - rev 3 (20.23)


View raw message