hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dexter morgan <dextermorga...@gmail.com>
Subject Re: best way to join?
Date Tue, 28 Aug 2012 16:04:25 GMT
Right, but if i understood your sugesstion, you look at the end goal ,
which is:

for example, and you say: here we see a cluster basically, that cluster is
represented by the point:  [40.123,-50.432]
which points does this cluster contains?  [[41.431,-
meaning: that for every point i have in the dataset, you create a cluster.
If you don't mean that, but you do mean to create clusters based on some
random-seed points or what not, that would mean
that i'll have points (talking about the "end goal") that won't have enough
points in their list.

one of the criterions for a clustering is that for any clusters: C_i and
C_j (where i != j), C_i intersect C_j is empty

and again, how can i accomplish my task with out running mahout / knn algo?
just by calculating distance between points?
join of a file with it self.


On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>> I understand your solution ( i think) , didn't think of that, in that
>> particular way.
>> I think that lets say i have 1M data-points, and running knn , that the
>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>> is an overkill.
> I am not sure I understand you.  n = number of points.  k = number of
> clusters.  For searching 1 million points, I would recommend thousands of
> clusters.
>> How can i achieve the same result WITHOUT using mahout, just running on
>> the dataset , i even think it'll be in the same complexity (o(n^2))
> Running with a good knn package will give you roughly O(n log n)
> complexity.

View raw message