commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <>
Subject Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer
Date Wed, 25 Mar 2020 15:14:42 GMT

> > [...]
> >>>>
> >>>>     I have started 2 PRs to solve the problem you metioned.
> >>>>
> >>>>     About the "CentroidInitializer" I have a new idea:
> >>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
> >>>> and add a construct parameter and a property "useKMeansPlusPlus" to
> >>>> "KMeansPlusPlusCluster":
> >>>> ```java
> >>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
> >>>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends
> >>>> Clusterer<T> {
> >>>>     public KMeansPlusPlusClusterer(final int k, final int maxIterations,
> >>>>                                final DistanceMeasure measure,
> >>>>                                final UniformRandomProvider random,
> >>>>                                final EmptyClusterStrategy emptyStrategy,
> >>>> +                             final useKMeansPlusPlus) {
> >>>>     // ...
> >>>> -  // Use K-means++ to choose the initial centers.
> >>>> -  this.centroidInitializer = new
> >>>> KMeansPlusPlusCentroidInitializer(measure, random);
> >>>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
> >>>> }
> >>>
> >>>What if one comes up with a third way to initialize the centroids?
> >>>If you can ensure that there is no other initialization procedure,
> >>>a boolean is fine, if not, we could still make the existing procedures
> >>>package-private (e.g. moving them in as classes defined within
> >>>"KMeansPlusPlusClusterer".
> >>
> >> As I know the k-means has two center initialize methods, random and
> >> k-means++ so far,
> >> use a boolean to choose which method to use is good enough for current use,
> >
> >Indeed but the question is: Will it be future-proof?
> >[We don't want to break compatibility (and have to release a
> >new major version of the library) for having overlooked the
> >question which I'm asking.]
> >
> >> but there are two situations use need to implement the center initialize
> >> method themselves:
> >> 1. The Commoans Maths's implements is not good enough;
> >
> >As this is FLOSS, the understanding (IMO) is in that case users
> >would contribute back to fix what needs be.
> >
> >> 2. There are new center initialize methods.
> >
> >So, that would be arguing against using a boolean (?).
> >Cf. above (about breaking compatibility).
> The name "KMeansPlusPlusClusterer" decide it won't support new initialize methods.
> Whether "KMeansPlusPlusClusterer" compatible random

Sorry, but I don't understand what you mean.

> >>>
> >>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
> >>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
> >>>Perhaps we don't want to depart from the original (?) algorithm; if so,
> >>>the new constructor could be made protected (thus simplifying the API).
> >>
> >> k-means++ is the recommend center initialize method for now days,
> >> show we let user to fall back to random choose centers, that is a question
> >> need to tradeoff.
> >
> >Is there any advantage to "random" init vs "kmeans" init?
> >E.g. is "random" faster, yet would give similar clustering results?
> k-means++ is an improvement of k-means(random init).

Do I understand correctly that to call the algorithm KMeans++,
the only change (from "Kmeans") is the centroid initialization
(using "KMeans" vs using "random")?
If so, it would be a contradiction (name vs functionality) to allow
"random" in a class called "KMeansPlusPlusClusterer".

So it seems the alternatives are:
 (1) Drop "random" init (and keep "KMeansPlusPlusClusterer") if
you are positive that this functionality won"t be missed.
 (2) Allow flexibility for init (and rename "KMeansPlusPlusClusterer"
to "KMeansClusterer"), adding a boolean argument to the constructor:
"doKMeansPlusPlus" to select "random" (false) or "kmeans" (true).
 (3) Allow flexibility by passing a function to perform the init.

According to what you wrote, (1) is the better alternative right
now. Option (3) might (perhaps!) become a nice enhancement,
but only after we settle on a better design for the fundamental
classes ("Cluster", "ClusterSet", "ClusterResult", ...).

> The performance loss by using "k-means++" is tinily relative to entire clustering process.
> I think keep old state is OK(make chooseInitialCenters package-private),
> but now "CentroidInitializer" is a middle state.

+1 if you mean that you'll remove, for now, the package "initialization".
Please submit a PR.

> >
> >> Show we make the API simple or rich?
> >
> >I'd keep it simple until we fix the (IMHO) more important
> >issues of thread-safety and sparse data.
> Some Personal opinion about "Sparse data":
> Math ML do not use a martix for clustering, but list of vectors(double arrays),
>  this is flexible and simple (This is the main attraction of Math ML to me).
> Operations on sparse martixs can be faster than common martixs,
> but not that faster on sparse vectors, but the performance improvement will bring complexity.
> Anyhow if we want to start this optimization,  I think there are 2 ways:
> 1. Define a abstract Martix, and make Cluser API work on Martix, this is a big change.
> 2. Defina a abstract Vector, and replace double[] with Vector, implement a series of
operations on Vector, include Sparse Vector.

Unless I'm missing some point, this is going astray; my mentioning
of "sparse data" is related to issue MATH-1330.[1]  IIUC the concern
there was that it should be possible to process the data without
forcing the instantiation of a "double[]".



> > [...]

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message