If a single point has been added to a cluster then the centroid should
become that point, not some average of the center and the point. The
center is the prior expectation of the centroid only. Once a point is
observed then the centroid has a posterior value that is (for kmeans, at
least) independent of the prior.
Grant is correct that storing the points does not scale. We do have
VisibleCluster which can be used for debugging purposes but the
assignment of points to clusters is properly the last, optional, step of
the job. (The current Mean Shift implementation uses a visible cluster
approach, since point membership cannot be deduced from knowledge of the
final clustering state alone. There is more work to be done here to make
this implementation really scale and I'm looking at ways to output
additional information that would allow cluster membership to be done at
the end like canopy and kmeans).
nfantone wrote:
> I really see no harm (algorithmically and conceptually) in returning
> the center as the centroid if there's only one point added to the
> cluster. If that's what you need to solve your predicament, I say go
> for it. Are there any drawbacks?
>
> What eludes me is the actual way of adding points. How can I compute
> its total set at any given moment? Say, I create a Cluster with a
> center, then add some points  the addPoint() just stores a pointTotal
> Vector with the total vector sum and want to check which vectors I
> have added so far with their original values. Is this even possible?
>
> On Mon, Jun 29, 2009 at 9:42 AM, Grant Ingersoll<gsingers@apache.org> wrote:
>
>> On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
>>
>>
>>> On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
>>>
>>>
>>>> I get all of this, my point is that when you rehydrate the Cluster, it
>>>> doesn't properly report the centroid per my email all because numPoints ==
0
>>>> and pointTotal is a a vector that is the same as the passed in center
>>>> vector, but initialized to 0.
>>>>
>>>>
>>> In other words, the simple act of serializing a Cluster to HDFS and then
>>> reconstituting it should not alter the result one gets, which I believe is
>>> what happens if one dumps out the clusters that have been calculated after
>>> the whole process is done.
>>>
>> [1] is what I had to do to work around it for the Random approach, but I
>> think it isn't the right approach.
>>
>> I think the problem lies in computeCentroid:
>> private Vector computeCentroid() {
>> if (numPoints == 0)
>> return pointTotal;
>> else if (centroid == null) {
>> // lazy compute new centroid
>> centroid = pointTotal.divide(numPoints);
>> Vector stds = pointSquaredTotal.times(numPoints).minus(
>> pointTotal.times(pointTotal)).assign(new SquareRootFunction())
>> .divide(numPoints);
>> std = stds.zSum() / 2;
>> }
>> return centroid;
>> }
>>
>> I don't understand why, if numPoints ==0, the next line isn't just: return
>> center; Why wouldn't the center and the centroid be the same if there are
>> no points? pointTotal in the rehydration case (or in the case of just
>> calling new Cluster(center) is just a vector of the same cardinality as
>> Center but all values are zero.
>>
>>
>>
>> [1]:
>> Author: gsingers
>> Date: Sat Jun 27 02:57:18 2009
>> New Revision: 788919
>>
>> URL: http://svn.apache.org/viewvc?rev=788919&view=rev
>> Log:
>> add the center as a point
>>
>> Modified:
>>
>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
>>
>> Modified:
>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
>> URL:
>> http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff
>> ==============================================================================
>> 
>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
>> (original)
>> +++
>> lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
>> Sat Jun 27 02:57:18 2009
>> @@ 54,7 +54,9 @@
>> if (log.isInfoEnabled()) {
>> log.info("Selected: " + value.asFormatString());
>> }
>>  writer.append(new Text(key.toString()), new Cluster(value));
>> + Cluster val = new Cluster(value);
>> + val.addPoint(value);
>> + writer.append(new Text(key.toString()), val);
>> count++;
>> }
>> }
>>
>>
>>
>
>
>
