mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kMeans Help
Date Mon, 29 Jun 2009 20:47:19 GMT
Changing the centroid of an empty cluster to return its center fixes a 
bug in the convergence calculation and causes convergence to happen 
earlier.  By returning a zero centroid vector instead of the center, the 
convergence test had marked empty clusters as not converged. This 
changes the outcome of the clustering. I changed the 
expectedNumPoints[2] to be {4,4,1} and the test passes.



Grant Ingersoll wrote:
> FYI, if I make this change the only test that fails is 
> TestKmeansClustering#testReferenceImplementation.
>
> See MAHOUT-141
>
> On Jun 29, 2009, at 12:07 PM, Jeff Eastman wrote:
>
>> I have no problem with returning center as the centroid for a cluster 
>> with no points. From Ted's earlier discussion, the center is the 
>> prior expectation of the centroid and returning a zero vector is just 
>> a bug that has not made itself apparent until now.
>>
>> I also agree that serializing and then deserializing a cluster (or 
>> any object for that matter) should not alter its state.
>>
>>
>> Grant Ingersoll wrote:
>>>
>>> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
>>>
>>>>
>>>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
>>>>
>>>>> I get all of this, my point is that when you rehydrate the 
>>>>> Cluster, it doesn't properly report the centroid per my email all 
>>>>> because numPoints == 0 and pointTotal is a a vector that is the 
>>>>> same as the passed in center vector, but initialized to 0.
>>>>>
>>>>
>>>> In other words, the simple act of serializing a Cluster to HDFS and 
>>>> then reconstituting it should not alter the result one gets, which 
>>>> I believe is what happens if one dumps out the clusters that have 
>>>> been calculated after the whole process is done.
>>>
>>> [1] is what I had to do to work around it for the Random approach, 
>>> but I think it isn't the right approach.
>>>
>>> I think the problem lies in computeCentroid:
>>> private Vector computeCentroid() {
>>>    if (numPoints == 0)
>>>      return pointTotal;
>>>    else if (centroid == null) {
>>>      // lazy compute new centroid
>>>      centroid = pointTotal.divide(numPoints);
>>>      Vector stds = pointSquaredTotal.times(numPoints).minus(
>>>          pointTotal.times(pointTotal)).assign(new SquareRootFunction())
>>>          .divide(numPoints);
>>>      std = stds.zSum() / 2;
>>>    }
>>>    return centroid;
>>>  }
>>>
>>> I don't understand why, if numPoints ==0, the next line isn't just: 
>>> return center;  Why wouldn't the center and the centroid be the same 
>>> if there are no points?  pointTotal in the rehydration case (or in 
>>> the case of just calling new Cluster(center) is just a vector of the 
>>> same cardinality as Center but all values are zero.
>>>
>>>
>>>
>>> [1]:
>>> Author: gsingers
>>> Date: Sat Jun 27 02:57:18 2009
>>> New Revision: 788919
>>>
>>> URL: http://svn.apache.org/viewvc?rev=788919&view=rev
>>> Log:
>>> add the center as a point
>>>
>>> Modified:
>>>   
>>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

>>>
>>>
>>> Modified: 
>>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

>>>
>>> URL: 
>>> http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff

>>>
>>> ==============================================================================

>>>
>>> --- 
>>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

>>> (original)
>>> +++ 
>>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java

>>> Sat Jun 27 02:57:18 2009
>>> @@ -54,7 +54,9 @@
>>>        if (log.isInfoEnabled()) {
>>>          log.info("Selected: " + value.asFormatString());
>>>        }
>>> -        writer.append(new Text(key.toString()), new Cluster(value));
>>> +        Cluster val = new Cluster(value);
>>> +        val.addPoint(value);
>>> +        writer.append(new Text(key.toString()), val);
>>>        count++;
>>>      }
>>>    }
>>>
>>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>


Mime
View raw message