What you have observed is correct. During the final iteration, points
are observed by each cluster and these observations are used to
calculate the new cluster center and radius. As that center moves less
than the convergence delta from the previous center, the iterations
stop. During the subsequent classification phase, each point is assigned
to its most likely cluster and this assignment may not always be to the
same cluster due to this final cluster center movement.
Decreasing the convergence delta and thus running more iterations may
help to resolve this problem; however, there are situations with KMeans
where the end state can oscillate between two or more very similar
clusterings. I think the only way to predictably use a
postclassification radius is to recalculate it at the end.
On 5/14/13 2:19 PM, Erinn Schorsch wrote:
>
> Thanks Jeff.
>
> After some additional investigation on our side, we find the
> math/stddeviation calculation to be correct, and that our data does
> have a radius of 0 (at KMeans Cluster Identification time)... all
> points were of the same value.
>
> The problem however is that we run the KMeans Classification process
> subsequently,... and it returns a set of vectors classified to the
> cluster in question, which have different values than the set from
> ClusterIdentification time. These points are not all the same value.
> The reason for this, is that at the end of ClusterIdentification, the
> center/radius are calculated... using this new center for the
> Classification run, the measurements for each vector vary from the
> last iteration of ClusterIdentification, and produce a different
> categorization of the data,... so the radius from the final iteration
> of ClusterIdentification does not represent the stddeviation of the
> classification results.
>
> *From:*Jeff Eastman [mailto:jeastman@windwardsolutions.com]
> *Sent:* Tuesday, May 14, 2013 11:10 AM
> *To:* Erinn Schorsch
> *Cc:* user@mahout.apache.org
> *Subject:* Re: Problems with KMeans Clustering  Radius calculation
> returns incorrect ZERO value in some cases.
>
> Hi Erinn,
>
> The radius calculation in KMeans and other clustering algorithms uses
> a running sums algorithm (see RunningSumsGaussianAccumulator) and the
> radius is really the standard deviation produced by this method. In
> this method (as you likely know) s0 is the number of points observed,
> s1 is the sum of those points and s2 is the sum of the squares of
> those points. This algorithm has some documented roundoff issues but
> your problem does not look like roundoff. You have not included the
> points in your example, but if they are all the same value for a
> cluster then I would expect their std and radius to be zero.
>
> Jeff
>
> On 5/9/13 9:17 PM, Erinn Schorsch wrote:
>
> I am working on an application using mahout KMeans clustering and
> classification. We use Canopy clusters to seed KMeans although I
> don't believe this to be relevant for this issue.
>
> Has anyone else experienced this issue? (details follow). Does
> anyone have any insight on whether radius=0 will affect if KMeans
> convergence is arrived at? (and therefore drive premature
> convergence when miscalculated to be 0).
>
> We have discovered what appears to me a defect in how the radius
> value is calculated for the clusters that Mahout/Kmeans generates.
> Generally, we are expecting that when a cluster includes data
> points (observations) which vary from the centerpoint, then
> radius should be some nonzero value. Our understanding is that a
> bigger radius, means a larger range of values,...
>
> We found cases where several data points where part of a cluster,
> however the radius is returned as 0. We use the radius to evaluate
> how usable/relevant each cluster is to our use case, so getting
> accurate radius is important in our case.
>
> We are using Mahout version 0.7
>
> Details:
>
> When calculating the KMeans clusters, the radius is calculated by
> the method: AbstractCluster.computeParamters()
>
> @Override
>
> public void computeParameters() {
>
> if (getS0() == 0) {
>
> return;
>
> }
>
> setNumObservations((long) getS0());
>
> setTotalObservations(getTotalObservations() + getNumObservations());
>
> setCenter(getS1().divide(getS0()));
>
> // compute the component stds
>
> if (getS0() > 1) {
>
> *setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
> SquareRootFunction()).divide(getS0()));*
>
> }
>
> setS0(0);
>
> setS1(center.like());
>
> setS2(center.like());
>
> }
>
> The important
> bit:*setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
> SquareRootFunction()).divide(getS0()));*
>
> Or, simplified/paraphrased: (S2 * S0) minus (S1 * S1)... then
> divide using a SquareRootFunction... (this last I don't think
> affects our scenario).
>
> Data in our case:
>
> S0=6.0
>
> S1={1:150.0,0:54.0}
>
> S2={1:3750.0,0:486.0}
>
> getS2().times(getS0())={1:22500.0,0:2916.0}
>
> getS1().times(getS1())={1:22500.0,0:2916.0}
>
> And follows: ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}
>
> ... Thus, we have a ZERO value for radius on this
> point/cluster.... And clearly this should not be the case. S1 and
> S2 represent 2 of the 6 "observations" (S0 is number of
> observations)... and S1 and S2 are different data points,... so
> radius must be nonzero?
>
> Is this a defect? Is there a flaw in our understanding/expectation
> of radius?
>
> It seems the algorithm for radius here equates basically to: (A *
> B)  ( C * C)
>
> It is easy to imagine many mathematical combinations where this
> equation will compute to ZERO. Can this be correct behavior?
>
> Thanks for any thoughts / input!
>
