hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn-Elmar Macek <ma...@cs.uni-kassel.de>
Subject Re: best way to join?
Date Tue, 04 Sep 2012 08:17:31 GMT
Hi Dexter,

i think, what you want is a clustering of points based on the euclidian 
distance or density based clustering ( 
http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some 
implemented quite well in Mahout already: afaik this is the datamining 
framework based on Hadoop.

Best luck!
Elmar


Am 27.08.2012 22:15, schrieb dexter morgan:
> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points 
> file and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it 
> will be an argument for the job) CLOSEST points (the mappers should 
> calculate the distance)..
>
> Distributed cache is not an option, what else?  not sure if to 
> classify it as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that 
> hard) or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!


Mime
View raw message