hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject Re: best way to join?
Date Sun, 09 Sep 2012 09:22:08 GMT
Elmar,

Right, thanks a lot for your help, If you'll read what Ted suggested its
basically this. I'm interesting in knowing how to do this using JOIN
(map-join + reducer-join i suppose) as well... though i'll go with the
mahout approach

Best,
Dex

On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek
<macek@cs.uni-kassel.de>wrote:

> Hi Dexter,
>
> i think, what you want is a clustering of points based on the euclidian
> distance or density based clustering ( http://en.wikipedia.org/wiki/**
> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I bet
> there are some implemented quite well in Mahout already: afaik this is the
> datamining framework based on Hadoop.
>
> Best luck!
> Elmar
>
>
> Am 27.08.2012 22:15, schrieb dexter morgan:
>
>  Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Mime
View raw message