mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Cluster-center and cluster-radius
Date Tue, 26 Jul 2011 15:22:08 GMT
The first problem is that the input doesn't have comparable variability.
 This means that distance is going to be pretty much just y-distance.

One way to improve this is to reduce each coordinate by dividing by the
standard deviation of that coordinate.

Depending on what your y coordinate is intended to mean, you might consider
a log transform of y.

I pulled these points into R and used k-means on the data.  The centroids
that I got were exactly the same as yours so the Mahout clustering appears
to be working well.

The cluster results are:

K-means clustering with 2 clusters of sizes 6, 4

Cluster means:
         x         y
1 11.66667  604.3333
2 12.25000 3963.2500

Clustering vector:
 [1] 2 2 1 2 1 1 1 1 1 2


If you use a log-transform of y, then these are the cluster results:

K-means clustering with 2 clusters of sizes 3, 7

Cluster means:
          x     logy
1  9.333333 6.861456
2 13.000000 7.132130

Clustering vector:
 [1] 2 2 2 2 2 2 2 1 1 1


Neither of these is very satisfying.  The untransformed version separates
only on y while the transformed version separates only on x.

On Tue, Jul 26, 2011 at 6:05 AM, Immo Micus <immomicus@googlemail.com>wrote:

> Hello,
>
> this is my first email to the mahout-user-list.
> I am trying to do some clustering with mahout and i have a question
> concerning the cluster-center and cluster-radius.
>
> For testing, i clustered 10 points using the KMeansClusterer:
>
> points:
>  [13.000, 4455.000]
>  [13.000, 5101.000]
>  [13.000,   333.000]
>  [13.000, 3412.000]
>  [13.000,   823.000]
>  [13.000,   238.000]
>  [13.000    951.000]
>  [  9.000,   311.000]
>  [  9.000,   970.000]
>  [10.000, 2885.000]
>
> This is the method i am using:
>
> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure,
> 10, 0.001);
>
> initial_clusters are 2 random points of the points above, measure is
> EuclideanDistanceMeasure.
>
>
> And this is the result of the converged clusters VL-0 and VL-1:
>
> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>
> If i understand this output right then n is the number of points that are
> assigned to the cluster. c is the cluster-center and r is the radius of the
> cluster.
> So, every point belongs to either cluster 0 or cluster 1. Actually you can
> even guess what points belong to what cluster but i am confused by the
> calculated cluster-center and cluster-radius:
> For example  [  9.000,   970.000] should belong to cluster 0, but   9.000 <
>  9.781 [11.667 -1.886] and 970.000 > 919.392  [604.333 + 315.059].  The
> point is not in range of the cluster, it obviously does not belong to
> cluster 1 but all 10 points are assigned to clusters. Can someone please
> tell me where the mistake is?
>
>
> greetings, Immo
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message