From mahout-user-return-1135-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Tue Jul 28 14:15:28 2009 Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 96239 invoked from network); 28 Jul 2009 14:15:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jul 2009 14:15:28 -0000 Received: (qmail 62136 invoked by uid 500); 28 Jul 2009 14:16:45 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 62076 invoked by uid 500); 28 Jul 2009 14:16:45 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 62066 invoked by uid 99); 28 Jul 2009 14:16:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 14:16:45 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nfantone@gmail.com designates 209.85.221.203 as permitted sender) Received: from [209.85.221.203] (HELO mail-qy0-f203.google.com) (209.85.221.203) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 14:16:35 +0000 Received: by qyk41 with SMTP id 41so63017qyk.29 for ; Tue, 28 Jul 2009 07:16:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type:content-transfer-encoding; bh=cX4iahJiiFIOPu5kCpjvgvbs9yV4+8pMchw510xxHfM=; b=YYQVm0ydcAEbZcmqZUR78py25P6DhPSXLFCG2E2S00dg1zA5GMKQg2WHX3l7m/DS7R dsIJ1abzaHENgXiAu0RXol4NVss4NC0A2PQ9U08PaAhg+TTCcfe0OaBVaHIk5PSD70ft Z87BeHGxgRnZfrxYk5tHcTu7cTPCQJnif+zB4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; b=n/pX90Ne1lsnIhvIU2XZc295SM9JUR3UDfHnr8M09N1pTgkb87kjuexcljCKJ8/pNW YB7cKaNmu8SNyIr0NVDg34webCQZX84wge9eeK820w0DyDOjeJ9FUnWoIAqSQ4aPa8C+ m34JYnVgUt9eRfnArLc2s2yUk4dWy5V98BSZM= MIME-Version: 1.0 Received: by 10.150.225.14 with SMTP id x14mr13658170ybg.8.1248790573830; Tue, 28 Jul 2009 07:16:13 -0700 (PDT) Date: Tue, 28 Jul 2009 11:16:13 -0300 Message-ID: <37ffc8080907280716occ7251w67564d40da610dfe@mail.gmail.com> Subject: Distance calculation performance issue From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Ok, this post is going to be a long one, and so it deserves its own thread. My apologies beforehand. Here's what I have tried to ease the distance calculation problem. I know it's quite nasty, but I wanted to ensure our bottleneck was there, in the euclidean distance computation. But first, a little excursus: /* EXCURSUS */ The "optimized" version for SparseVectors of distance() in SquaredEuclideanDistanceMeasure.java, currently, does the following: if (centroid.size() != v.size()) { throw new CardinalityException(); } double result = centroidLengthSquare; result += v.getDistanceSquared(centroid); return centroidLengthSquare + v.getDistanceSquared(centroid); It expects a vector squared length, which is finally added to what getDistanceSquared() returns. However, that method computes this inside a while loop (for the comments, assume two 2D vectors [X0, X1] and [Y0, Y1]): elt = iter.next(); centroidValue = v.getQuick(elt.index()); delta = elt.get() - centroidValue; result += (delta * delta) - (centroidValue * centroidValue); // This is (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1) - Y1*Y1; I don't have the slightest idea what to call this value. I certainly couldn't understand the reason behind that (centroidValue * centroidValue) subtraction. Being called getDistanceSquared, the method should return just that... and that is the value one gets by just ignoring that substraction, that is: ... result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 - Y1)*(X1 - Y1); the ACTUAL squared distance. ... Moreover, the sum of every (centroidValue * centroidValue) in each iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to centroidLengthSquare, which was previously calculated before calling distance(centroidLengthSquare, centroid, v) and then added to the result. So, to sum up: first, a certain length is calculated (the squared norm of what the signature from distance() calls "centroid"); second, that SAME value gets calculated again in an another method and subtracted from a certain total; last, those values cancel each other by summing both totals, leaving just the wanted squared distance, here: return centroidLengthSquare + v.getDistanceSquared(centroid); Which could be re-written as: return centroidLengthSquare + squaredDistanceBetweenCentroidAndV - centroidLengthSquare; Maybe this has been done on purpose, though that purpose eludes me for now. /* END EXCURSUS */ And now, the fun part: code disrupting. Here are the changes I applied (remember: just for testing performance's sake). I left the changed bits commented. EuclideanDistanceMeasure.java Renamed distance(double centroidLengthSquare, Vector centroid, Vector v) to sparseDistance and eliminate the first param. We don't need the sqrt to compare real distances for emitting points to clusters (but we do need them to compute a cluster's convergence), so I took that out, as well (I know the whole point of this class is to sqrt the results from SquaredEuclideanDistance, but I needed the distance function that's compromising performance to do a minimal set of calculations) . @Override public double sparseDistance(/*double centroidLengthSquare,*/ Vector centroid, Vector v) { return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/ centroid, v);//*); } SquaredEuclideanDistanceMeasure.java Rewrote and renamed the implementation of distance(double centroidLengthSquare, Vector centroid, Vector v). Ignored size check (again: just for the benefit of performance). Just return the result of getDistanceSquared(). @Override public double sparseDistance(/*double centroidLengthSquare,*/ Vector centroid, Vector v) { /* if (centroid.size() != v.size()) { throw new CardinalityException(); } double result = centroidLengthSquare; result += v.getDistanceSquared(centroid); */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid); } SparseVector.java Refactor of getDistanceSquared(). Variables should not be created in a loop, if there's no need to do so. Eliminate the centroidValue^2 calculation in each iteration. @Override public double getDistanceSquared(Vector v) { //TODO: Check sizes? double result = 0.0; Iterator iter = iterateNonZero(); Vector.Element elt; double delta; while (iter.hasNext()) { elt = iter.next(); delta = elt.get() - v.getQuick(elt.index()); result += (delta * delta); //- (centroidValue * centroidValue); } return result; } Cluster.java Refactor of emitPointToNearestCluster(). null comparison eliminated (not necessary anymore). sparseDistance() is called now, instead. ... Cluster nearestCluster = null; double nearestDistance = Double.MAX_VALUE; double distance = 0; for (Cluster cluster : clusters) { distance = measure.sparseDistance(cluster.getCenter(), point); if (distance < nearestDistance) { nearestCluster = cluster; nearestDistance = distance; } } ... Add a Math.sqrt call to computeConvergence(), which now uses sparseDistance(). public boolean computeConvergence() { Vector centroid = computeCentroid(); converged = Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/ centroid, center)) <= convergenceDelta; return converged; } After all those modifications, kMeans now runs noticeably faster. Running the whole thing locally -in a pseudo-distributed cluster-, every iteration is taking me about ~7 minutes to complete using the very same dataset I uploaded and posted some days ago, which used to last 3 hours. Again, this is not meant to be a final, closed solution at all. But, I think it could very well serve as a first step towards that.