hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject Re: best way to join?
Date Thu, 30 Aug 2012 09:21:21 GMT
Ok, but as i said before, how do i achieve the same result with out
clustering , just linear. Join on the same data-set basically?

and calculating the distance as i go

On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> I don't mean that.
>
> I mean that a k-means clustering with pretty large clusters is a useful
> auxiliary data structure for finding nearest neighbors.  The basic outline
> is that you find the nearest clusters and search those for near neighbors.
>  The first riff is that you use a clever data structure for finding the
> nearest clusters so that you can do that faster than linear search.  The
> second riff is when you use another clever data structure to search each
> cluster quickly.
>
> There are fancier data structures available as well.
>
>
> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>
>> Right, but if i understood your sugesstion, you look at the end goal ,
>> which is:
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>
>> for example, and you say: here we see a cluster basically, that cluster
>> is represented by the point:  [40.123,-50.432]
>> which points does this cluster contains?  [[41.431,-
>> 43.32],[...,...],...,[...]]
>> meaning: that for every point i have in the dataset, you create a cluster.
>> If you don't mean that, but you do mean to create clusters based on some
>> random-seed points or what not, that would mean
>>  that i'll have points (talking about the "end goal") that won't have
>> enough points in their list.
>>
>> one of the criterions for a clustering is that for any clusters: C_i and
>> C_j (where i != j), C_i intersect C_j is empty
>>
>> and again, how can i accomplish my task with out running mahout / knn
>> algo? just by calculating distance between points?
>> join of a file with it self.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <tdunning@maprtech.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>>
>>>> I understand your solution ( i think) , didn't think of that, in that
>>>> particular way.
>>>> I think that lets say i have 1M data-points, and running knn , that the
>>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>> is an overkill.
>>>>
>>>
>>> I am not sure I understand you.  n = number of points.  k = number of
>>> clusters.  For searching 1 million points, I would recommend thousands of
>>> clusters.
>>>
>>>
>>>> How can i achieve the same result WITHOUT using mahout, just running on
>>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>
>>>
>>> Running with a good knn package will give you roughly O(n log n)
>>> complexity.
>>>
>>>
>>
>

Mime
View raw message