commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Barger <>
Subject Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input data.
Date Tue, 31 May 2016 12:53:42 GMT

I'm working on providing a solution for MATH-1330 and facing several design
related issues which I'd like to share, since I'd like my solution to fit
with the project road map and integrity. So, I'm looking on Clusterable
interface and looks like automatically impose the way internal
representation of data should look like, since getPoint() method signature
indirectly assumes that it has to be an array of doubles. And this might
not be a true for certain cases. IMO replacing of getPoint() with
getDistanceTo(Clusterable a) could be a better solution, since it doesn't
assumes anything about internal representation. From other side that means
what Clusterable instances need to be aware which DistanceMeasure
implementation used for clustering.

Therefore I'm not completely sure how to move on with it. Moreover suppose
I'll replace getPoint to return RealVector, then next issue will be to
decide how should I define/create cluster centers. Whenever do I need to
use sparse or dense implementation?

One of the possible solutions I'm thinking of is to decouple the process of
seeding the initial cluster centers and the Lloyd's iterations. That way I
can actually seed initial centers, provide them as a parameter into
clustering algorithm, which will move centers during the iteration instead
of creating each time new centroid instance. While it will work for center
based clustering algorithm that will not be the case for others, hence not
sure how I can fit this solution into the current design.

Any thoughts or suggestions?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message