Hello.
> > [...]
> >>>>
> >>>> I have started 2 PRs to solve the problem you metioned.
> >>>>
> >>>> About the "CentroidInitializer" I have a new idea:
> >>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
> >>>> and add a construct parameter and a property "useKMeansPlusPlus" to
> >>>> "KMeansPlusPlusCluster":
> >>>> ```java
> >>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
> >>>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends
> >>>> Clusterer<T> {
> >>>> public KMeansPlusPlusClusterer(final int k, final int maxIterations,
> >>>> final DistanceMeasure measure,
> >>>> final UniformRandomProvider random,
> >>>> final EmptyClusterStrategy emptyStrategy,
> >>>> + final useKMeansPlusPlus) {
> >>>> // ...
> >>>>  // Use Kmeans++ to choose the initial centers.
> >>>>  this.centroidInitializer = new
> >>>> KMeansPlusPlusCentroidInitializer(measure, random);
> >>>> + this.useKMeansPlusPlus = useKMeansPlusPlus;
> >>>> }
> >>>
> >>>What if one comes up with a third way to initialize the centroids?
> >>>If you can ensure that there is no other initialization procedure,
> >>>a boolean is fine, if not, we could still make the existing procedures
> >>>packageprivate (e.g. moving them in as classes defined within
> >>>"KMeansPlusPlusClusterer".
> >>
> >> As I know the kmeans has two center initialize methods, random and
> >> kmeans++ so far,
> >> use a boolean to choose which method to use is good enough for current use,
> >
> >Indeed but the question is: Will it be futureproof?
> >[We don't want to break compatibility (and have to release a
> >new major version of the library) for having overlooked the
> >question which I'm asking.]
> >
> >> but there are two situations use need to implement the center initialize
> >> method themselves:
> >> 1. The Commoans Maths's implements is not good enough;
> >
> >As this is FLOSS, the understanding (IMO) is in that case users
> >would contribute back to fix what needs be.
> >
> >> 2. There are new center initialize methods.
> >
> >So, that would be arguing against using a boolean (?).
> >Cf. above (about breaking compatibility).
> The name "KMeansPlusPlusClusterer" decide it won't support new initialize methods.
> Whether "KMeansPlusPlusClusterer" compatible random
Sorry, but I don't understand what you mean.
> >>>
> >>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
> >>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
> >>>Perhaps we don't want to depart from the original (?) algorithm; if so,
> >>>the new constructor could be made protected (thus simplifying the API).
> >>
> >> kmeans++ is the recommend center initialize method for now days,
> >> show we let user to fall back to random choose centers, that is a question
> >> need to tradeoff.
> >
> >Is there any advantage to "random" init vs "kmeans" init?
> >E.g. is "random" faster, yet would give similar clustering results?
>
> kmeans++ is an improvement of kmeans(random init).
Do I understand correctly that to call the algorithm KMeans++,
the only change (from "Kmeans") is the centroid initialization
(using "KMeans" vs using "random")?
If so, it would be a contradiction (name vs functionality) to allow
"random" in a class called "KMeansPlusPlusClusterer".
So it seems the alternatives are:
(1) Drop "random" init (and keep "KMeansPlusPlusClusterer") if
you are positive that this functionality won"t be missed.
(2) Allow flexibility for init (and rename "KMeansPlusPlusClusterer"
to "KMeansClusterer"), adding a boolean argument to the constructor:
"doKMeansPlusPlus" to select "random" (false) or "kmeans" (true).
(3) Allow flexibility by passing a function to perform the init.
According to what you wrote, (1) is the better alternative right
now. Option (3) might (perhaps!) become a nice enhancement,
but only after we settle on a better design for the fundamental
classes ("Cluster", "ClusterSet", "ClusterResult", ...).
> The performance loss by using "kmeans++" is tinily relative to entire clustering process.
> I think keep old state is OK(make chooseInitialCenters packageprivate),
> but now "CentroidInitializer" is a middle state.
+1 if you mean that you'll remove, for now, the package "initialization".
Please submit a PR.
> >
> >> Show we make the API simple or rich?
> >
> >I'd keep it simple until we fix the (IMHO) more important
> >issues of threadsafety and sparse data.
>
> Some Personal opinion about "Sparse data":
> Math ML do not use a martix for clustering, but list of vectors(double arrays),
> this is flexible and simple (This is the main attraction of Math ML to me).
>
> Operations on sparse martixs can be faster than common martixs,
> but not that faster on sparse vectors, but the performance improvement will bring complexity.
>
> Anyhow if we want to start this optimization, I think there are 2 ways:
> 1. Define a abstract Martix, and make Cluser API work on Martix, this is a big change.
> 2. Defina a abstract Vector, and replace double[] with Vector, implement a series of
operations on Vector, include Sparse Vector.
Unless I'm missing some point, this is going astray; my mentioning
of "sparse data" is related to issue MATH1330.[1] IIUC the concern
there was that it should be possible to process the data without
forcing the instantiation of a "double[]".
Best,
Gilles
[1] https://issues.apache.org/jira/browse/MATH1330
> > [...]

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
