hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject Re: best way to join?
Date Fri, 31 Aug 2012 13:03:00 GMT
Hi Ted,

First of all, i'd like to know how to make a map/reduce job that does join
on the input-file it self.
Second, maybe your clustering approach be usefull, i still think it's not
correct.
Reason:
Lets say i want to find the 10 closest points for a given point. Point:
[120,90] for example.
Clustering approach: which cluster has [120,90] as a node? answer: the
cluster at [300,200]
Now, if i understood you, i should get the 10 nearest neighbors of
[300,200] (again, you didn't elaborate much on this or i didn't understand
it)

But i require the 10 nearest to [120,90] , not to [300,200]. Even if i know
the distances from [120,90] to [300,200] and to the 10 nearest points to
[300,200] it won't help me, because maybe the 10 nearest points to [120,90]
are actually starting from the 5000th nearest points to [300,200].

In the end my goal is to pre-process (as i wrote at the begining) this list
of N nearest points for every point in the file. Where N is a parameter
given to the job. Let's say 10 points. That's it.
No calculation after-wards, only querying that list.

Thank you

On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> I don't know off-hand.  I don't understand the importance of your
> constraint either.
>
>
> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>
>> Ok, but as i said before, how do i achieve the same result with out
>> clustering , just linear. Join on the same data-set basically?
>>
>> and calculating the distance as i go
>>
>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <tdunning@maprtech.com>wrote:
>>
>>> I don't mean that.
>>>
>>> I mean that a k-means clustering with pretty large clusters is a useful
>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>> is that you find the nearest clusters and search those for near neighbors.
>>>  The first riff is that you use a clever data structure for finding the
>>> nearest clusters so that you can do that faster than linear search.  The
>>> second riff is when you use another clever data structure to search each
>>> cluster quickly.
>>>
>>> There are fancier data structures available as well.
>>>
>>>
>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>> dextermorgan4u@gmail.com> wrote:
>>>
>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>> which is:
>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>
>>>> for example, and you say: here we see a cluster basically, that cluster
>>>> is represented by the point:  [40.123,-50.432]
>>>> which points does this cluster contains?  [[41.431,-
>>>> 43.32],[...,...],...,[...]]
>>>> meaning: that for every point i have in the dataset, you create a
>>>> cluster.
>>>> If you don't mean that, but you do mean to create clusters based on
>>>> some random-seed points or what not, that would mean
>>>>  that i'll have points (talking about the "end goal") that won't have
>>>> enough points in their list.
>>>>
>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>
>>>> and again, how can i accomplish my task with out running mahout / knn
>>>> algo? just by calculating distance between points?
>>>> join of a file with it self.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <tdunning@maprtech.com>wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I understand your solution ( i think) , didn't think of that, in
that
>>>>>> particular way.
>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10
points)
>>>>>> is an overkill.
>>>>>>
>>>>>
>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>> clusters.  For searching 1 million points, I would recommend thousands
of
>>>>> clusters.
>>>>>
>>>>>
>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>
>>>>>
>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>> complexity.
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message