mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Whitmore, Mattie" <mwhit...@harris.com>
Subject RE: Mahout-279/kmeans++
Date Fri, 17 Aug 2012 14:36:12 GMT
Hi Ted,

Yes this is great!  I hope to start working with this algorithm in the next couple weeks.

I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold,
 I have this value set at zero, but the output is still showing that about 1/3 of my data
is not assigned to a cluster in my output.  Am I using this value incorrectly?  I did a kmeansdriver.run
with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold
= 0.


Thanks,

Mattie


-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, August 15, 2012 5:20 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

Mattie,

Would this help?

https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java

and

https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf

On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com>wrote:

> Hi!
>
> I have been using RandomSeedGenerator, and was hoping it had a patch like
> that described in Mahout-279 since I want only 10 vectors out of a set of
> more than 100,000,000.  I have been using canopy clustering for better
> results, but still need to do a few passes of kmeans to determine my T, and
> the random seed does take a long time.
>
> The comments say that you are working on a kmeans++, I searched around but
> couldn't confirm any more information about it.  Is a scalable kmeans++ in
> the works? (I know research on the subject is quite new)
>
> Thanks!
>
>
>
> Mattie Whitmore
> Mathematician/IR&D Software Engineer
> HARRIS  Corporation - Advanced Information Solutions
> 301.837.5278
> mwhitmor@harris.com<mailto:tiffany.forkner@harris.com>
>
>
>
>
Mime
View raw message