Ok, this post is going to be a long one, and so it deserves its own
thread. My apologies beforehand.
Here's what I have tried to ease the distance calculation problem. I
know it's quite nasty, but I wanted to ensure our bottleneck was
there, in the euclidean distance computation.
But first, a little excursus:
/* EXCURSUS */
The "optimized" version for SparseVectors of distance() in
SquaredEuclideanDistanceMeasure.java, currently, does the following:
if (centroid.size() != v.size()) {
throw new CardinalityException();
}
double result = centroidLengthSquare;
result += v.getDistanceSquared(centroid);
return centroidLengthSquare + v.getDistanceSquared(centroid);
It expects a vector squared length, which is finally added to what
getDistanceSquared() returns. However, that method computes this
inside a while loop (for the comments, assume two 2D vectors [X0, X1]
and [Y0, Y1]):
elt = iter.next();
centroidValue = v.getQuick(elt.index());
delta = elt.get()  centroidValue;
result += (delta * delta)  (centroidValue * centroidValue); //
This is (X0  Y0)*(X0  Y0)  Y0*Y0 + (X1  Y1)*(X1  Y1)  Y1*Y1; I
don't have the slightest idea what to call this value.
I certainly couldn't understand the reason behind that (centroidValue
* centroidValue) subtraction. Being called getDistanceSquared, the
method should return just that... and that is the value one gets by
just ignoring that substraction, that is:
...
result += (delta * delta); //This is (X0  Y0)*(X0  Y0) + (X1 
Y1)*(X1  Y1); the ACTUAL squared distance.
...
Moreover, the sum of every (centroidValue * centroidValue) in each
iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
centroidLengthSquare, which was previously calculated before calling
distance(centroidLengthSquare, centroid, v) and then added to the
result. So, to sum up: first, a certain length is calculated (the
squared norm of what the signature from distance() calls "centroid");
second, that SAME value gets calculated again in an another method and
subtracted from a certain total; last, those values cancel each other
by summing both totals, leaving just the wanted squared distance,
here:
return centroidLengthSquare + v.getDistanceSquared(centroid);
Which could be rewritten as:
return centroidLengthSquare + squaredDistanceBetweenCentroidAndV 
centroidLengthSquare;
Maybe this has been done on purpose, though that purpose eludes me for now.
/* END EXCURSUS */
And now, the fun part: code disrupting. Here are the changes I applied
(remember: just for testing performance's sake). I left the changed
bits commented.
EuclideanDistanceMeasure.java
Renamed distance(double centroidLengthSquare, Vector centroid, Vector
v) to sparseDistance and eliminate the first param. We don't need the
sqrt to compare real distances for emitting points to clusters (but we
do need them to compute a cluster's convergence), so I took that out,
as well (I know the whole point of this class is to sqrt the results
from SquaredEuclideanDistance, but I needed the distance function
that's compromising performance to do a minimal set of calculations) .
@Override
public double sparseDistance(/*double centroidLengthSquare,*/ Vector
centroid, Vector v) {
return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
centroid, v);//*);
}
SquaredEuclideanDistanceMeasure.java
Rewrote and renamed the implementation of distance(double
centroidLengthSquare, Vector centroid, Vector v). Ignored size check
(again: just for the benefit of performance). Just return the result
of getDistanceSquared().
@Override
public double sparseDistance(/*double centroidLengthSquare,*/ Vector
centroid, Vector v) {
/* if (centroid.size() != v.size()) {
throw new CardinalityException();
}
double result = centroidLengthSquare;
result += v.getDistanceSquared(centroid);
*/ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
}
SparseVector.java
Refactor of getDistanceSquared(). Variables should not be created in a
loop, if there's no need to do so. Eliminate the centroidValue^2
calculation in each iteration.
@Override
public double getDistanceSquared(Vector v) {
//TODO: Check sizes?
double result = 0.0;
Iterator<Vector.Element> iter = iterateNonZero();
Vector.Element elt;
double delta;
while (iter.hasNext()) {
elt = iter.next();
delta = elt.get()  v.getQuick(elt.index());
result += (delta * delta); // (centroidValue * centroidValue);
}
return result;
}
Cluster.java
Refactor of emitPointToNearestCluster(). null comparison eliminated
(not necessary anymore). sparseDistance() is called now, instead.
...
Cluster nearestCluster = null;
double nearestDistance = Double.MAX_VALUE;
double distance = 0;
for (Cluster cluster : clusters) {
distance = measure.sparseDistance(cluster.getCenter(), point);
if (distance < nearestDistance) {
nearestCluster = cluster;
nearestDistance = distance;
}
}
...
Add a Math.sqrt call to computeConvergence(), which now uses sparseDistance().
public boolean computeConvergence() {
Vector centroid = computeCentroid();
converged =
Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
centroid, center)) <= convergenceDelta;
return converged;
}
After all those modifications, kMeans now runs noticeably faster.
Running the whole thing locally in a pseudodistributed cluster,
every iteration is taking me about ~7 minutes to complete using the
very same dataset I uploaded and posted some days ago, which used to
last 3 hours. Again, this is not meant to be a final, closed solution
at all. But, I think it could very well serve as a first step towards
that.
