Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <5045B91B.9080202@cs.uni-kassel.de>
Date: Tue, 04 Sep 2012 10:17:31 +0200
From: =?ISO-8859-1?Q?Bj=F6rn-Elmar_Macek?= <macek@cs.uni-kassel.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1;
 rv:15.0) Gecko/20120824 Thunderbird/15.0
MIME-Version: 1.0
To: user@hadoop.apache.org
Subject: Re: best way to join?
References: 
 <CABtCYFzx4k36x0-V8M8T_ovSc6sd4EL-Hhn56jNHRbG5UGbQSA@mail.gmail.com>
In-Reply-To: 
 <CABtCYFzx4k36x0-V8M8T_ovSc6sd4EL-Hhn56jNHRbG5UGbQSA@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Dexter,

i think, what you want is a clustering of points based on the euclidian 
distance or density based clustering ( 
http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some 
implemented quite well in Mahout already: afaik this is the datamining 
framework based on Hadoop.

Best luck!
Elmar


Am 27.08.2012 22:15, schrieb dexter morgan:
> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points 
> file and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it 
> will be an argument for the job) CLOSEST points (the mappers should 
> calculate the distance)..
>
> Distributed cache is not an option, what else?  not sure if to 
> classify it as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that 
> hard) or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!