Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 37302 invoked from network); 29 Jun 2009 21:02:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jun 2009 21:02:31 -0000 Received: (qmail 93182 invoked by uid 500); 29 Jun 2009 21:02:42 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 93116 invoked by uid 500); 29 Jun 2009 21:02:42 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 93106 invoked by uid 99); 29 Jun 2009 21:02:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 21:02:42 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.81] (HELO spunkymail-a14.g.dreamhost.com) (208.97.132.81) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 21:02:31 +0000 Received: from [192.168.0.105] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a14.g.dreamhost.com (Postfix) with ESMTP id 783D9190E2C for ; Mon, 29 Jun 2009 14:02:10 -0700 (PDT) Message-Id: <7124EC9B-EF76-47F9-8751-49E6FB8AEFA6@apache.org> From: Grant Ingersoll To: mahout-user@lucene.apache.org In-Reply-To: <4A492857.6080500@windwardsolutions.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: kMeans Help Date: Mon, 29 Jun 2009 17:02:09 -0400 References: <9D2894FE-5DBC-45C7-B8EC-AF3CD01E91CB@apache.org> <4A453BF5.70401@windwardsolutions.com> <4A455DB5.1040500@windwardsolutions.com> <7FC9E5D4-3AF4-44A1-A819-18C9ED627F0C@apache.org> <28304FA2-6615-4D2F-94D3-1EBFC86B8229@apache.org> <3E9D6FD3-CC7E-4C8D-9732-1E2AAB0300EE@apache.org> <4A79CB0A-6EE7-4640-A3C3-22F0B0B41746@apache.org> <37ffc8080906270718j725fd486qe2635f36d7158c87@mail.gmail.com> <4A463DCB.2040701@windwardsolutions.com> <627F3A4C-DF85-4866-8340-CCB2F74A66DA@apache.org> <84F2E80F-9D26-4AB6-941E-BEA3C5B24DF6@apache.org> <71D98AC1-1156-4B07-83BC-492997EA67C6@apache.org> <4A48E6DF.6040202@windwardsolutions.com> <4A492857.6080500@windwardsolutions.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org OK, please commit. Thx! On Jun 29, 2009, at 4:47 PM, Jeff Eastman wrote: > Changing the centroid of an empty cluster to return its center fixes > a bug in the convergence calculation and causes convergence to > happen earlier. By returning a zero centroid vector instead of the > center, the convergence test had marked empty clusters as not > converged. This changes the outcome of the clustering. I changed the > expectedNumPoints[2] to be {4,4,1} and the test passes. > > > > Grant Ingersoll wrote: >> FYI, if I make this change the only test that fails is >> TestKmeansClustering#testReferenceImplementation. >> >> See MAHOUT-141 >> >> On Jun 29, 2009, at 12:07 PM, Jeff Eastman wrote: >> >>> I have no problem with returning center as the centroid for a >>> cluster with no points. From Ted's earlier discussion, the center >>> is the prior expectation of the centroid and returning a zero >>> vector is just a bug that has not made itself apparent until now. >>> >>> I also agree that serializing and then deserializing a cluster (or >>> any object for that matter) should not alter its state. >>> >>> >>> Grant Ingersoll wrote: >>>> >>>> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote: >>>> >>>>> >>>>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote: >>>>> >>>>>> I get all of this, my point is that when you rehydrate the >>>>>> Cluster, it doesn't properly report the centroid per my email >>>>>> all because numPoints == 0 and pointTotal is a a vector that is >>>>>> the same as the passed in center vector, but initialized to 0. >>>>>> >>>>> >>>>> In other words, the simple act of serializing a Cluster to HDFS >>>>> and then reconstituting it should not alter the result one gets, >>>>> which I believe is what happens if one dumps out the clusters >>>>> that have been calculated after the whole process is done. >>>> >>>> [1] is what I had to do to work around it for the Random >>>> approach, but I think it isn't the right approach. >>>> >>>> I think the problem lies in computeCentroid: >>>> private Vector computeCentroid() { >>>> if (numPoints == 0) >>>> return pointTotal; >>>> else if (centroid == null) { >>>> // lazy compute new centroid >>>> centroid = pointTotal.divide(numPoints); >>>> Vector stds = pointSquaredTotal.times(numPoints).minus( >>>> pointTotal.times(pointTotal)).assign(new >>>> SquareRootFunction()) >>>> .divide(numPoints); >>>> std = stds.zSum() / 2; >>>> } >>>> return centroid; >>>> } >>>> >>>> I don't understand why, if numPoints ==0, the next line isn't >>>> just: return center; Why wouldn't the center and the centroid be >>>> the same if there are no points? pointTotal in the rehydration >>>> case (or in the case of just calling new Cluster(center) is just >>>> a vector of the same cardinality as Center but all values are zero. >>>> >>>> >>>> >>>> [1]: >>>> Author: gsingers >>>> Date: Sat Jun 27 02:57:18 2009 >>>> New Revision: 788919 >>>> >>>> URL: http://svn.apache.org/viewvc?rev=788919&view=rev >>>> Log: >>>> add the center as a point >>>> >>>> Modified: >>>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ >>>> clustering/kmeans/RandomSeedGenerator.java >>>> >>>> Modified: lucene/mahout/trunk/core/src/main/java/org/apache/ >>>> mahout/clustering/kmeans/RandomSeedGenerator.java >>>> URL: http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff >>>> = >>>> = >>>> = >>>> = >>>> = >>>> = >>>> = >>>> = >>>> = >>>> = >>>> = >>>> =================================================================== >>>> --- lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ >>>> clustering/kmeans/RandomSeedGenerator.java (original) >>>> +++ lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ >>>> clustering/kmeans/RandomSeedGenerator.java Sat Jun 27 02:57:18 2009 >>>> @@ -54,7 +54,9 @@ >>>> if (log.isInfoEnabled()) { >>>> log.info("Selected: " + value.asFormatString()); >>>> } >>>> - writer.append(new Text(key.toString()), new >>>> Cluster(value)); >>>> + Cluster val = new Cluster(value); >>>> + val.addPoint(value); >>>> + writer.append(new Text(key.toString()), val); >>>> count++; >>>> } >>>> } >>>> >>>> >>> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >> using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >> > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search