commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles <gil...@harfang.homelinux.org>
Subject Re: [Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.
Date Wed, 01 Jun 2016 14:46:02 GMT
On Wed, 1 Jun 2016 17:24:47 +0300, Artem Barger wrote:
> ​
> On Tue, May 31, 2016 at 4:04 PM, Artem Barger <artem@bargr.net> 
> wrote:
>
>> Hi,
>>
>> Current implementation of kmeans within CM framework, inherently 
>> uses
>> algorithm published by  Arthur, David, and Sergei Vassilvitskii.
>> "k-means++: The advantages of careful seeding." *Proceedings of the
>> eighteenth annual ACM-SIAM symposium on Discrete algorithms*. 
>> Society for
>> Industrial and Applied Mathematics, 2007. While there other 
>> alternative
>> algorithms for initial seeding is available, for instance:
>>
>> 1. Random initialization (each center picked uniformly at random).
>> 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
>> 3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time
>> approximations for generalized k-mean/median/center." *Proceedings 
>> of the
>> twenty-third annual symposium on Computational geometry*. ACM, 2007.
>>
>> While I understand that kmeans++ is preferable option, others could 
>> be
>> also used for testing, trials and evaluations as well.
>>
>> I'd like to propose to separate logic of seeding and clustering to
>> increase flexibility for kmeans clustering. Would be glad to hear 
>> your
>> comments, pros/cons or rejections...
>>
>>
> I've found "Scalable KMeans" or kmeans|| as referred in the
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which 
> provides
> parallelizable seeding procedure.
> ​I guess this might serve as additional +1 vote for doing separation
> between seeding and LLoyd's iterations in current implementations of 
> kmeans.

I guess that, around here, you are the expert about these algorithms...
So go ahead and (re)write the code as you see fit, while still taking 
into
account that the code should be self-documenting as much as possible.
And OO (since this is Java).

If you are up for a major refactoring (e.g. for sparse data), I'd 
suggest
to do it in a new package, so that we can easily compare the old and 
new
codes (e.g. run the tests).

And if you contemplate parallelization, I wonder whether the issue of
switching to Java 8 might not have to be resolved first.


Regards,
Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message