mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Distance calculation performance issue
Date Tue, 28 Jul 2009 20:29:17 GMT
Funny, I did this same thing on the plane by revisiting Shashi's patch  
on MAHOUT-121 and properly applying it to K-Means (I missed one  
critical line that uses the centroid based distance method instead of  
the (Vector, Vector) one).  I think your data set ran, for 10  
iterations, in just over 2 minutes and that was with the profiler  
hooked up, too.

I will post a patch on MAHOUT-121, can you take a look?  Also, if you  
can post your changes as a patch, then we can likely compare/merge,  
etc. and come up with the best way of doing all of this.

Thanks for the insight/analysis.

-Grant


On Jul 28, 2009, at 10:16 AM, nfantone wrote:

> Ok, this post is going to be a long one, and so it deserves its own
> thread. My apologies beforehand.
>
> Here's what I have tried to ease the distance calculation problem. I
> know it's quite nasty, but I wanted to ensure our bottleneck was
> there, in the euclidean distance computation.
>
> But first, a little excursus:
>
> /* EXCURSUS */
> The "optimized" version for SparseVectors of distance() in
> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>
>    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>    result += v.getDistanceSquared(centroid);
>    return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> It expects a vector squared length, which is finally added to what
> getDistanceSquared() returns. However, that method computes this
> inside a while loop (for the comments, assume two 2D vectors [X0, X1]
> and [Y0, Y1]):
>
>      elt = iter.next();
>      centroidValue = v.getQuick(elt.index());
>      delta = elt.get() - centroidValue;
>      result += (delta * delta) - (centroidValue * centroidValue); //
> This is  (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1)  - Y1*Y1; I
> don't have the slightest idea what to call this value.
>
> I certainly couldn't understand the reason behind that (centroidValue
> * centroidValue) subtraction. Being called getDistanceSquared, the
> method should return just that... and that is the value one gets by
> just ignoring that substraction, that is:
>
> ...
> result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 -
> Y1)*(X1 - Y1); the ACTUAL squared distance.
> ...
>
> Moreover, the sum of every (centroidValue * centroidValue) in each
> iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
> centroidLengthSquare, which was previously calculated before calling
> distance(centroidLengthSquare, centroid, v) and then added to the
> result. So, to sum up: first, a certain length is calculated (the
> squared norm of what the signature from distance() calls "centroid");
> second, that SAME value gets calculated again in an another method and
> subtracted from a certain total; last, those values cancel each other
> by summing both totals, leaving just the wanted squared distance,
> here:
>
> return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> Which could be re-written as:
>
> return centroidLengthSquare + squaredDistanceBetweenCentroidAndV -
> centroidLengthSquare;
>
> Maybe this has been done on purpose, though that purpose eludes me  
> for now.
>
> /* END EXCURSUS */
>
> And now, the fun part: code disrupting. Here are the changes I applied
> (remember: just for testing performance's sake). I left the changed
> bits commented.
>
> EuclideanDistanceMeasure.java
> Renamed distance(double centroidLengthSquare, Vector centroid, Vector
> v) to sparseDistance and eliminate the first param. We don't need the
> sqrt to compare real distances for emitting points to clusters (but we
> do need them to compute a cluster's convergence), so I took that out,
> as well (I know the whole point of this class is to sqrt the results
> from SquaredEuclideanDistance, but I needed the distance function
> that's compromising performance to do a minimal set of calculations) .
>
>  @Override
>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
> centroid, Vector v) {
>    return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
> centroid, v);//*);
>  }
>
>
> SquaredEuclideanDistanceMeasure.java
> Rewrote and renamed the implementation of distance(double
> centroidLengthSquare, Vector centroid, Vector v). Ignored size check
> (again: just for the benefit of performance). Just return the result
> of getDistanceSquared().
>
> @Override
>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
> centroid, Vector v) {
> /*    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>   result += v.getDistanceSquared(centroid);
> */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
>  }
>
> SparseVector.java
> Refactor of getDistanceSquared(). Variables should not be created in a
> loop, if there's no need to do so. Eliminate the centroidValue^2
> calculation in each iteration.
>
> @Override
>  public double getDistanceSquared(Vector v) {
>    //TODO: Check sizes?
>
>    double result = 0.0;
>    Iterator<Vector.Element> iter = iterateNonZero();
>    Vector.Element elt;
>    double delta;
>
>    while (iter.hasNext()) {
>      elt = iter.next();
>      delta = elt.get() - v.getQuick(elt.index());
>      result += (delta * delta); //- (centroidValue * centroidValue);
>    }
>    return result;
>  }
>
> Cluster.java
> Refactor of emitPointToNearestCluster(). null comparison eliminated
> (not necessary anymore). sparseDistance() is called now, instead.
>
> ...
>   Cluster nearestCluster = null;
>    double nearestDistance = Double.MAX_VALUE;
>    double distance = 0;
>    for (Cluster cluster : clusters) {
>      distance = measure.sparseDistance(cluster.getCenter(), point);
>      if (distance < nearestDistance) {
>        nearestCluster = cluster;
>        nearestDistance = distance;
>      }
>    }
> ...
>
> Add a Math.sqrt call to computeConvergence(), which now uses  
> sparseDistance().
>
>  public boolean computeConvergence() {
>    Vector centroid = computeCentroid();
>    converged =
> Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
> centroid, center)) <= convergenceDelta;
>    return converged;
>  }
>
>
> After all those modifications, kMeans now runs noticeably faster.
> Running the whole thing locally -in a pseudo-distributed cluster-,
> every iteration is taking me about ~7 minutes to complete using the
> very same dataset I uploaded and posted some days ago, which used to
> last 3 hours. Again, this is not meant to be a final, closed solution
> at all. But, I think it could very well serve as a first step towards
> that.


Mime
View raw message