hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: best way to join?
Date Tue, 28 Aug 2012 20:07:40 GMT
I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorgan4u@gmail.com>wrote:

> Right, but if i understood your sugesstion, you look at the end goal ,
> which is:
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> for example, and you say: here we see a cluster basically, that cluster is
> represented by the point:  [40.123,-50.432]
> which points does this cluster contains?  [[41.431,-
> 43.32],[...,...],...,[...]]
> meaning: that for every point i have in the dataset, you create a cluster.
> If you don't mean that, but you do mean to create clusters based on some
> random-seed points or what not, that would mean
>  that i'll have points (talking about the "end goal") that won't have
> enough points in their list.
> one of the criterions for a clustering is that for any clusters: C_i and
> C_j (where i != j), C_i intersect C_j is empty
> and again, how can i accomplish my task with out running mahout / knn
> algo? just by calculating distance between points?
> join of a file with it self.
> Thanks
> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <tdunning@maprtech.com>wrote:
>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>>> I understand your solution ( i think) , didn't think of that, in that
>>> particular way.
>>> I think that lets say i have 1M data-points, and running knn , that the
>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>> is an overkill.
>> I am not sure I understand you.  n = number of points.  k = number of
>> clusters.  For searching 1 million points, I would recommend thousands of
>> clusters.
>>> How can i achieve the same result WITHOUT using mahout, just running on
>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>> Running with a good knn package will give you roughly O(n log n)
>> complexity.

View raw message