hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject best way to join?
Date Mon, 27 Aug 2012 20:15:32 GMT
Dear list,

Lets say i have a file, like this:
id \t at,tlng <-- structure

1\t40.123,-50.432
2\t41.431,-43.32
...
...
lets call it: 'points.txt'
I'm trying to build a map-reduce job that runs over this BIG points file
and it should output
a file, that will look like:
id[lat,lng] \t [list of points in JSON standard] <--- structure

1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
...

Basically it should run on ITSELF, and grab for each point the N (it will
be an argument for the job) CLOSEST points (the mappers should calculate
the distance)..

Distributed cache is not an option, what else?  not sure if to classify it
as a map-join , reduce-join or both?
Would you do this in HIVE some how?
Is it feasible in a single job?

Would LOVE to hear your suggestions, code (if you think its not that hard)
or what not.
BTW using CDH3 - rev 3 (20.23)

Thanks!

Mime
View raw message