mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Federico Castanedo <castanedof...@gmail.com>
Subject Re: some new clustering code
Date Fri, 06 Apr 2012 19:48:47 GMT
Hello Ted,

The only difference I notice between  Kmeans and StreamingKmeans class
is the dynamic increment of maxClusters and the distanceCutoff test.
So, i execute the KMeans class against a subset of the BigCross
dataset and it works fine.

What is the rationale behind choosing  f = estimateCutoff() instead of
 f=1/(k(1+logn)) of the original algorithm?

On the other hand, the estimateCutoff() could be biased to the minimum
distance of the first 100 points in the stream, or perhaps the idea is
to update also this value online...

2012/4/5 Ted Dunning <ted.dunning@gmail.com>:
> I have some new clustering code that I have been working.  It will probably
> be targeted back at Mahout at some point, but for reasons of agility, I
> have been running it out of github.
>
> The salient point is that there are essentially no knobs that need turning
> other than specifying a distance measure and possibly a large minimum
> number of clusters.  The output is a clustering that can be searched
> efficiently.
>
> The key point is that this algorithm is
>
> - single pass
>
> - easily map-reducible
>
> - fast
>
> The third point is a salient one.  On my laptop running in single threaded
> mode, this code is able to cluster 1,000,000 points in 20 dimensions into
> 1000 clusters in about a minute.
>
> See the StreamingKmeans class at https://github.com/tdunning/knn for more
> info.  The algorithm is based loosely on
>
> http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
>
>
> This code does not yet use the Mahout clustering API conventions, but is
> based entirely on the Mahout math package.
>
> Kibitzers welcome.

Mime
View raw message