mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: kMeans Help
Date Mon, 29 Jun 2009 17:34:17 GMT
FYI, if I make this change the only test that fails is  
TestKmeansClustering#testReferenceImplementation.

See MAHOUT-141

On Jun 29, 2009, at 12:07 PM, Jeff Eastman wrote:

> I have no problem with returning center as the centroid for a  
> cluster with no points. From Ted's earlier discussion, the center is  
> the prior expectation of the centroid and returning a zero vector is  
> just a bug that has not made itself apparent until now.
>
> I also agree that serializing and then deserializing a cluster (or  
> any object for that matter) should not alter its state.
>
>
> Grant Ingersoll wrote:
>>
>> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
>>
>>>
>>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
>>>
>>>> I get all of this, my point is that when you rehydrate the  
>>>> Cluster, it doesn't properly report the centroid per my email all  
>>>> because numPoints == 0 and pointTotal is a a vector that is the  
>>>> same as the passed in center vector, but initialized to 0.
>>>>
>>>
>>> In other words, the simple act of serializing a Cluster to HDFS  
>>> and then reconstituting it should not alter the result one gets,  
>>> which I believe is what happens if one dumps out the clusters that  
>>> have been calculated after the whole process is done.
>>
>> [1] is what I had to do to work around it for the Random approach,  
>> but I think it isn't the right approach.
>>
>> I think the problem lies in computeCentroid:
>> private Vector computeCentroid() {
>>    if (numPoints == 0)
>>      return pointTotal;
>>    else if (centroid == null) {
>>      // lazy compute new centroid
>>      centroid = pointTotal.divide(numPoints);
>>      Vector stds = pointSquaredTotal.times(numPoints).minus(
>>          pointTotal.times(pointTotal)).assign(new  
>> SquareRootFunction())
>>          .divide(numPoints);
>>      std = stds.zSum() / 2;
>>    }
>>    return centroid;
>>  }
>>
>> I don't understand why, if numPoints ==0, the next line isn't just:  
>> return center;  Why wouldn't the center and the centroid be the  
>> same if there are no points?  pointTotal in the rehydration case  
>> (or in the case of just calling new Cluster(center) is just a  
>> vector of the same cardinality as Center but all values are zero.
>>
>>
>>
>> [1]:
>> Author: gsingers
>> Date: Sat Jun 27 02:57:18 2009
>> New Revision: 788919
>>
>> URL: http://svn.apache.org/viewvc?rev=788919&view=rev
>> Log:
>> add the center as a point
>>
>> Modified:
>>   lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>> clustering/kmeans/RandomSeedGenerator.java
>>
>> Modified: lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>> clustering/kmeans/RandomSeedGenerator.java
>> URL: http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> --- lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>> clustering/kmeans/RandomSeedGenerator.java (original)
>> +++ lucene/mahout/trunk/core/src/main/java/org/apache/mahout/ 
>> clustering/kmeans/RandomSeedGenerator.java Sat Jun 27 02:57:18 2009
>> @@ -54,7 +54,9 @@
>>        if (log.isInfoEnabled()) {
>>          log.info("Selected: " + value.asFormatString());
>>        }
>> -        writer.append(new Text(key.toString()), new Cluster(value));
>> +        Cluster val = new Cluster(value);
>> +        val.addPoint(value);
>> +        writer.append(new Text(key.toString()), val);
>>        count++;
>>      }
>>    }
>>
>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message