mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kMeans Help
Date Mon, 29 Jun 2009 16:07:59 GMT
I have no problem with returning center as the centroid for a cluster 
with no points. From Ted's earlier discussion, the center is the prior 
expectation of the centroid and returning a zero vector is just a bug 
that has not made itself apparent until now.

I also agree that serializing and then deserializing a cluster (or any 
object for that matter) should not alter its state.


Grant Ingersoll wrote:
>
> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
>
>>
>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
>>
>>> I get all of this, my point is that when you rehydrate the Cluster, 
>>> it doesn't properly report the centroid per my email all because 
>>> numPoints == 0 and pointTotal is a a vector that is the same as the 
>>> passed in center vector, but initialized to 0.
>>>
>>
>> In other words, the simple act of serializing a Cluster to HDFS and 
>> then reconstituting it should not alter the result one gets, which I 
>> believe is what happens if one dumps out the clusters that have been 
>> calculated after the whole process is done.
>
> [1] is what I had to do to work around it for the Random approach, but 
> I think it isn't the right approach.
>
> I think the problem lies in computeCentroid:
> private Vector computeCentroid() {
>     if (numPoints == 0)
>       return pointTotal;
>     else if (centroid == null) {
>       // lazy compute new centroid
>       centroid = pointTotal.divide(numPoints);
>       Vector stds = pointSquaredTotal.times(numPoints).minus(
>           pointTotal.times(pointTotal)).assign(new SquareRootFunction())
>           .divide(numPoints);
>       std = stds.zSum() / 2;
>     }
>     return centroid;
>   }
>
> I don't understand why, if numPoints ==0, the next line isn't just: 
> return center;  Why wouldn't the center and the centroid be the same 
> if there are no points?  pointTotal in the rehydration case (or in the 
> case of just calling new Cluster(center) is just a vector of the same 
> cardinality as Center but all values are zero.
>
>
>
> [1]:
> Author: gsingers
> Date: Sat Jun 27 02:57:18 2009
> New Revision: 788919
>
> URL: http://svn.apache.org/viewvc?rev=788919&view=rev
> Log:
> add the center as a point
>
> Modified:
>    
> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

>
>
> Modified: 
> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

>
> URL: 
> http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff

>
> ============================================================================== 
>
> --- 
> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

> (original)
> +++ 
> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

> Sat Jun 27 02:57:18 2009
> @@ -54,7 +54,9 @@
>         if (log.isInfoEnabled()) {
>           log.info("Selected: " + value.asFormatString());
>         }
> -        writer.append(new Text(key.toString()), new Cluster(value));
> +        Cluster val = new Cluster(value);
> +        val.addPoint(value);
> +        writer.append(new Text(key.toString()), val);
>         count++;
>       }
>     }
>
>


Mime
View raw message