Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 31776 invoked from network); 29 Jun 2009 20:47:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jun 2009 20:47:46 -0000 Received: (qmail 83361 invoked by uid 500); 29 Jun 2009 20:47:56 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 83303 invoked by uid 500); 29 Jun 2009 20:47:56 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 83293 invoked by uid 99); 29 Jun 2009 20:47:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 20:47:56 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.208.4.195] (HELO mout.perfora.net) (74.208.4.195) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 20:47:45 +0000 Received: from jeff-eastmans-macbook-pro.local (c-71-198-3-140.hsd1.ca.comcast.net [71.198.3.140]) by mrelay.perfora.net (node=mrus0) with ESMTP (Nemesis) id 0MKp8S-1MLNl42dAm-000SkG; Mon, 29 Jun 2009 16:47:24 -0400 Received: from jeff-eastmans-macbook-pro.local by jeff-eastmans-macbook-pro.local (PGP Universal service); Mon, 29 Jun 2009 13:47:23 -0700 X-PGP-Universal: processed; by jeff-eastmans-macbook-pro.local on Mon, 29 Jun 2009 13:47:23 -0700 Message-ID: <4A492857.6080500@windwardsolutions.com> Date: Mon, 29 Jun 2009 13:47:19 -0700 From: Jeff Eastman User-Agent: Thunderbird 2.0.0.22 (Macintosh/20090605) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: kMeans Help References: <9D2894FE-5DBC-45C7-B8EC-AF3CD01E91CB@apache.org> <4A453BF5.70401@windwardsolutions.com> <4A455DB5.1040500@windwardsolutions.com> <7FC9E5D4-3AF4-44A1-A819-18C9ED627F0C@apache.org> <28304FA2-6615-4D2F-94D3-1EBFC86B8229@apache.org> <3E9D6FD3-CC7E-4C8D-9732-1E2AAB0300EE@apache.org> <4A79CB0A-6EE7-4640-A3C3-22F0B0B41746@apache.org> <37ffc8080906270718j725fd486qe2635f36d7158c87@mail.gmail.com> <4A463DCB.2040701@windwardsolutions.com> <627F3A4C-DF85-4866-8340-CCB2F74A66DA@apache.org> <84F2E80F-9D26-4AB6-941E-BEA3C5B24DF6@apache.org> <71D98AC1-1156-4B07-83BC-492997EA67C6@apache.org> <4A48E6DF.6040202@windwardsolutions.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX18WX8BlfPDb6r4Re7DwliAoVFTNsIJ843P8+La YpzN8H9sHWYX45B20Mi70EQhVLiaUTR+j6/+Uu/GWwQ5HTev1u LAhnvR8JxpcGlDpFV5hL0Ghxi9VCawpPaL+zZ0NmBo= X-Virus-Checked: Checked by ClamAV on apache.org Changing the centroid of an empty cluster to return its center fixes a bug in the convergence calculation and causes convergence to happen earlier. By returning a zero centroid vector instead of the center, the convergence test had marked empty clusters as not converged. This changes the outcome of the clustering. I changed the expectedNumPoints[2] to be {4,4,1} and the test passes. Grant Ingersoll wrote: > FYI, if I make this change the only test that fails is > TestKmeansClustering#testReferenceImplementation. > > See MAHOUT-141 > > On Jun 29, 2009, at 12:07 PM, Jeff Eastman wrote: > >> I have no problem with returning center as the centroid for a cluster >> with no points. From Ted's earlier discussion, the center is the >> prior expectation of the centroid and returning a zero vector is just >> a bug that has not made itself apparent until now. >> >> I also agree that serializing and then deserializing a cluster (or >> any object for that matter) should not alter its state. >> >> >> Grant Ingersoll wrote: >>> >>> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote: >>> >>>> >>>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote: >>>> >>>>> I get all of this, my point is that when you rehydrate the >>>>> Cluster, it doesn't properly report the centroid per my email all >>>>> because numPoints == 0 and pointTotal is a a vector that is the >>>>> same as the passed in center vector, but initialized to 0. >>>>> >>>> >>>> In other words, the simple act of serializing a Cluster to HDFS and >>>> then reconstituting it should not alter the result one gets, which >>>> I believe is what happens if one dumps out the clusters that have >>>> been calculated after the whole process is done. >>> >>> [1] is what I had to do to work around it for the Random approach, >>> but I think it isn't the right approach. >>> >>> I think the problem lies in computeCentroid: >>> private Vector computeCentroid() { >>> if (numPoints == 0) >>> return pointTotal; >>> else if (centroid == null) { >>> // lazy compute new centroid >>> centroid = pointTotal.divide(numPoints); >>> Vector stds = pointSquaredTotal.times(numPoints).minus( >>> pointTotal.times(pointTotal)).assign(new SquareRootFunction()) >>> .divide(numPoints); >>> std = stds.zSum() / 2; >>> } >>> return centroid; >>> } >>> >>> I don't understand why, if numPoints ==0, the next line isn't just: >>> return center; Why wouldn't the center and the centroid be the same >>> if there are no points? pointTotal in the rehydration case (or in >>> the case of just calling new Cluster(center) is just a vector of the >>> same cardinality as Center but all values are zero. >>> >>> >>> >>> [1]: >>> Author: gsingers >>> Date: Sat Jun 27 02:57:18 2009 >>> New Revision: 788919 >>> >>> URL: http://svn.apache.org/viewvc?rev=788919&view=rev >>> Log: >>> add the center as a point >>> >>> Modified: >>> >>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java >>> >>> >>> Modified: >>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java >>> >>> URL: >>> http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff >>> >>> ============================================================================== >>> >>> --- >>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java >>> (original) >>> +++ >>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java >>> Sat Jun 27 02:57:18 2009 >>> @@ -54,7 +54,9 @@ >>> if (log.isInfoEnabled()) { >>> log.info("Selected: " + value.asFormatString()); >>> } >>> - writer.append(new Text(key.toString()), new Cluster(value)); >>> + Cluster val = new Cluster(value); >>> + val.addPoint(value); >>> + writer.append(new Text(key.toString()), val); >>> count++; >>> } >>> } >>> >>> >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > > >