commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Barger <>
Subject [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input data.
Date Mon, 25 Apr 2016 12:52:03 GMT
Hi All,

I'd like to provide a solution for [MATH-1330] issue. Before starting I
have a concerns regarding the possible design and the actual implementation.

Currently all implementations of Clusterer interface expect to receive
instance of DistanceMeasure class, which used to compute distance or metric
between two points. Switching clustering algorithms to work with Vectors
will make this unnecessary, therefore there will be no need to provide
DistanceMeasure, since Vector class already provides methods to compute
vector norms.

The main drawback of this approach is that we will loose the ability to
control which metric to use during clustering, however the only classes
which make an implicit use of this parameters are: Clusterer and
KmeansPlusPlusClusterer all others assumes EucledianDistance by default.

One of the possible approaches is to extend DistanceMeasure interface to be
able to compute distance between two vectors? After all it's only sub first
vector from the second and compute desired norm on the result.

Another possible solution is to make vector to return it's coordinates,
hence it avail us to use DistanceMeasure as is. Personally I do not think
this is good approach, since it will make no sense with sparse vectors.

Last alternative this comes to my mind is to create a set of enums to
indicate which vector norm to use to compute distances, also do no think
this is very good solution, since sounds too intrusive and might break
backward compatibility.

What do you think? Am I missing something? Is there a better possible way
to achieve the goal?

Best regards,
                      Artem Barger.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message