hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mirko Kämpf <mirko.kae...@gmail.com>
Subject Re: best way to join?
Date Sun, 09 Sep 2012 09:55:33 GMT
Hi Dexter,

I am no sure if I understood your requirements right.
So I repet it to define a starting point.

1.) You have a (static) list of points (the points.txt file)

2.) Now you want to calculate the nearest points to a set of given points.
Are the points which have to be considered in a different data set or do
you look for closest points "within" your big list (in the points.txt) file?

Lets assume the last is what you want:

I suggest Mahout to do this. This could be helpfull: "

"Vector Similarity

Mahout contains implementations that allow one to compare one or more
vectors with another set of vectors. This can be useful if one is, for
instance, trying to calculate the pairwise similarity between all documents
(or a subset of docs) in a corpus.

   - RowSimilarityJob – Builds an inverted index and then computes
   distances between items that have co-occurrences. This is a fully
   distributed calculation.
   - VectorDistanceJob – Does a map side join between a set of "seed"
   vectors and all of the input vectors.


If you look for pairs within just one data set you use this as seed vectors
as well as input vectors, otherwise you use different files or portions of

I hope that helps.

Best wishes


2012/9/9 dexter morgan <dextermorgan4u@gmail.com>

> Elmar,
> Right, thanks a lot for your help, If you'll read what Ted suggested its
> basically this. I'm interesting in knowing how to do this using JOIN
> (map-join + reducer-join i suppose) as well... though i'll go with the
> mahout approach
> Best,
> Dex
> On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek <macek@cs.uni-kassel.de
> > wrote:
>> Hi Dexter,
>> i think, what you want is a clustering of points based on the euclidian
>> distance or density based clustering ( http://en.wikipedia.org/wiki/**
>> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I
>> bet there are some implemented quite well in Mahout already: afaik this is
>> the datamining framework based on Hadoop.
>> Best luck!
>> Elmar
>> Am 27.08.2012 22:15, schrieb dexter morgan:
>>  Dear list,
>>> Lets say i have a file, like this:
>>> id \t at,tlng <-- structure
>>> 1\t40.123,-50.432
>>> 2\t41.431,-43.32
>>> ...
>>> ...
>>> lets call it: 'points.txt'
>>> I'm trying to build a map-reduce job that runs over this BIG points file
>>> and it should output
>>> a file, that will look like:
>>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>>> ...
>>> Basically it should run on ITSELF, and grab for each point the N (it
>>> will be an argument for the job) CLOSEST points (the mappers should
>>> calculate the distance)..
>>> Distributed cache is not an option, what else?  not sure if to classify
>>> it as a map-join , reduce-join or both?
>>> Would you do this in HIVE some how?
>>> Is it feasible in a single job?
>>> Would LOVE to hear your suggestions, code (if you think its not that
>>> hard) or what not.
>>> BTW using CDH3 - rev 3 (20.23)
>>> Thanks!

View raw message