mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: kMeans Help
Date Mon, 29 Jun 2009 21:02:09 GMT
OK, please commit.   Thx!


On Jun 29, 2009, at 4:47 PM, Jeff Eastman wrote:

> Changing the centroid of an empty cluster to return its center fixes  
> a bug in the convergence calculation and causes convergence to  
> happen earlier.  By returning a zero centroid vector instead of the  
> center, the convergence test had marked empty clusters as not  
> converged. This changes the outcome of the clustering. I changed the  
> expectedNumPoints[2] to be {4,4,1} and the test passes.
>
>
>
> Grant Ingersoll wrote:
>> FYI, if I make this change the only test that fails is  
>> TestKmeansClustering#testReferenceImplementation.
>>
>> See MAHOUT-141
>>
>> On Jun 29, 2009, at 12:07 PM, Jeff Eastman wrote:
>>
>>> I have no problem with returning center as the centroid for a  
>>> cluster with no points. From Ted's earlier discussion, the center  
>>> is the prior expectation of the centroid and returning a zero  
>>> vector is just a bug that has not made itself apparent until now.
>>>
>>> I also agree that serializing and then deserializing a cluster (or  
>>> any object for that matter) should not alter its state.
>>>
>>>
>>> Grant Ingersoll wrote:
>>>>
>>>> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
>>>>
>>>>>
>>>>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
>>>>>
>>>>>> I get all of this, my point is that when you rehydrate the  
>>>>>> Cluster, it doesn't properly report the centroid per my email  
>>>>>> all because numPoints == 0 and pointTotal is a a vector that is 

>>>>>> the same as the passed in center vector, but initialized to 0.
>>>>>>
>>>>>
>>>>> In other words, the simple act of serializing a Cluster to HDFS  
>>>>> and then reconstituting it should not alter the result one gets,  
>>>>> which I believe is what happens if one dumps out the clusters  
>>>>> that have been calculated after the whole process is done.
>>>>
>>>> [1] is what I had to do to work around it for the Random  
>>>> approach, but I think it isn't the right approach.
>>>>
>>>> I think the problem lies in computeCentroid:
>>>> private Vector computeCentroid() {
>>>>   if (numPoints == 0)
>>>>     return pointTotal;
>>>>   else if (centroid == null) {
>>>>     // lazy compute new centroid
>>>>     centroid = pointTotal.divide(numPoints);
>>>>     Vector stds = pointSquaredTotal.times(numPoints).minus(
>>>>         pointTotal.times(pointTotal)).assign(new  
>>>> SquareRootFunction())
>>>>         .divide(numPoints);
>>>>     std = stds.zSum() / 2;
>>>>   }
>>>>   return centroid;
>>>> }
>>>>
>>>> I don't understand why, if numPoints ==0, the next line isn't  
>>>> just: return center;  Why wouldn't the center and the centroid be  
>>>> the same if there are no points?  pointTotal in the rehydration  
>>>> case (or in the case of just calling new Cluster(center) is just  
>>>> a vector of the same cardinality as Center but all values are zero.
>>>>
>>>>
>>>>
>>>> [1]:
>>>> Author: gsingers
>>>> Date: Sat Jun 27 02:57:18 2009
>>>> New Revision: 788919
>>>>
>>>> URL: http://svn.apache.org/viewvc?rev=788919&view=rev
>>>> Log:
>>>> add the center as a point
>>>>
>>>> Modified:
>>>>  lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>>>> clustering/kmeans/RandomSeedGenerator.java
>>>>
>>>> Modified: lucene/mahout/trunk/core/src/main/java/org/apache/ 
>>>> mahout/clustering/kmeans/RandomSeedGenerator.java
>>>> URL: http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> ===================================================================
>>>> --- lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>>>> clustering/kmeans/RandomSeedGenerator.java (original)
>>>> +++ lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>>>> clustering/kmeans/RandomSeedGenerator.java Sat Jun 27 02:57:18 2009
>>>> @@ -54,7 +54,9 @@
>>>>       if (log.isInfoEnabled()) {
>>>>         log.info("Selected: " + value.asFormatString());
>>>>       }
>>>> -        writer.append(new Text(key.toString()), new  
>>>> Cluster(value));
>>>> +        Cluster val = new Cluster(value);
>>>> +        val.addPoint(value);
>>>> +        writer.append(new Text(key.toString()), val);
>>>>       count++;
>>>>     }
>>>>   }
>>>>
>>>>
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message