hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject Re: best way to join?
Date Tue, 28 Aug 2012 13:48:05 GMT
Dear Ted,

I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.

How can i achieve the same result WITHOUT using mahout, just running on the
dataset , i even think it'll be in the same complexity (o(n^2))
and calculating the distance between each indifferent points?

and maybe the reducer would just sort them in DESC order for each point.

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <tdunning@maprtech.com> wrote:

> Mahout is getting some very fast knn code in version 0.8.
> The basic work flow is that you would first do a large-scale clustering of
> the data.  Then you would make a second pass using the clustering to
> facilitate fast search for nearby points.
> The clustering will require two map-reduce jobs, one to find the cluster
> definitions and the second map-only to assign points to clusters in a form
> to be used by the second pass.  The second pass is a map-only process as
> well.
> In order to make this as efficient as possible, it is desirable to use a
> distribution of Hadoop that allows you to directly map the cluster data
> structures into shared memory.  IF you have NFS access to your cluster,
> this is easy.  Otherwise, it is considerably trickier.
> On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>> Dear list,
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
>> ...
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>> Thanks!

View raw message