hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject Re: best way to join?
Date Mon, 27 Aug 2012 20:24:21 GMT
Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Mime
View raw message