Right, but if i understood your sugesstion, you look at the end goal ,
which is:
1[40.123,50.432]\t[[41.431,43.32],[...,...],...,[...]]
for example, and you say: here we see a cluster basically, that cluster is
represented by the point: [40.123,50.432]
which points does this cluster contains? [[41.431,
43.32],[...,...],...,[...]]
meaning: that for every point i have in the dataset, you create a cluster.
If you don't mean that, but you do mean to create clusters based on some
randomseed points or what not, that would mean
that i'll have points (talking about the "end goal") that won't have enough
points in their list.
one of the criterions for a clustering is that for any clusters: C_i and
C_j (where i != j), C_i intersect C_j is empty
and again, how can i accomplish my task with out running mahout / knn algo?
just by calculating distance between points?
join of a file with it self.
Thanks
On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <tdunning@maprtech.com> wrote:
>
>
> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>
>>
>> I understand your solution ( i think) , didn't think of that, in that
>> particular way.
>> I think that lets say i have 1M datapoints, and running knn , that the
>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>> is an overkill.
>>
>
> I am not sure I understand you. n = number of points. k = number of
> clusters. For searching 1 million points, I would recommend thousands of
> clusters.
>
>
>> How can i achieve the same result WITHOUT using mahout, just running on
>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>
>
> Running with a good knn package will give you roughly O(n log n)
> complexity.
>
>
