mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: kMeans Help
Date Sat, 27 Jun 2009 17:22:22 GMT
Another way to say it that subsumes several different techniques is that
there is an prior expectation of the centroid.  The prior and the observed
points participate in computing the new centroid according to weights
associated with the prior and points.  The difference between different
techniques is in the weighting.

For k-means, we are doing maximum likelihood estimation with hard cluster
membership  and the weighting is infinitessimal on the prior and equal on
all the observed points.  The centroid with not observed points is
determined by the prior, but as soon as there are any observed points, the
centroid depends only on those points and not on the prior.

With Bayesian clustering such as the DP stuff with Gaussian clusters, the
prior has significant weight (that weight is a parameter of the algorithm)
and the weight is determined according to how well the clusters describe
where the point is (in a probabilistic sense).  The centroid is a weighted
average of the prior and the points according to these weights.

On Sat, Jun 27, 2009 at 8:42 AM, Jeff Eastman <>wrote:

> I think this comment is on the right track. During an iteration, each
> cluster is created with a center and no points. Then, as each point is
> compared against the cluster centers, it is added to the closest cluster. If
> the initial center is considered to be a point, then it will bias the new
> centroid calculation towards its center, incorrectly, as shown below.
> One could argue that the centroid of a degenerate cluster with no points
> ought to be its center and not a zero vector, but clusters with points
> should have centroids that do not include it.
> nfantone wrote:
>> On Sat, Jun 27, 2009 at 8:10 AM, Grant Ingersoll<>
>> wrote:
>>> On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
>>>> The semantics of constructing a Cluster are odd to me.  Do I always have
>>>> to immediately add a point to the Cluster in order for it to be "real",
>>>> despite the fact that I added a Center?  Isn't adding a Center
>>>> effectively
>>>> giving the Cluster one point?
>> Perhaps I misunderstood you, but I think that by assigning a new point
>> (by calling addPoint(Vector)) to a Cluster does not mean you are
>> "adding a center". A center is specified at the beginning of the
>> algorithm and every iteration, after including a set of new points,
>> recalculates that center by determining a new means - which is now the
>> centroid of that particular Cluster. So, clearly, the center itself is
>> a proper point in the Cluster and you don't need to add it after being
>> selected as that in order for it to be "real".
>>> And if you add the center, why isn't it the centroid until other points
>>> are
>>> added?
>> Again, the centroid is the result of a recalculation of a means and
>> may or may not be a real point. By having just one point in a Cluster
>> - that is to say, its center - there's no "recalculation" to be done.
>> Conceptually, you could say the centroid lies, in fact, in the center
>> - though, it's not relevant to the algorithm.
>> A final example. Let's say you create a Cluster C with point (1,1) as
>> its center. Then, you add (3,3) to it.
>> Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)
>> Now, you create another Cluster C' with the same center, but decide to
>> add the point again. Then, (3,3) is added.
>> Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid (5/3,
>> 5/3).
>> Ok, that was an unnecesary example. Got it. But it shows that C and C'
>> are not the same cluster, based on the fact that point repetition
>> contribute to a general means.

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message