The first problem is that the input doesn't have comparable variability.
This means that distance is going to be pretty much just ydistance.
One way to improve this is to reduce each coordinate by dividing by the
standard deviation of that coordinate.
Depending on what your y coordinate is intended to mean, you might consider
a log transform of y.
I pulled these points into R and used kmeans on the data. The centroids
that I got were exactly the same as yours so the Mahout clustering appears
to be working well.
The cluster results are:
Kmeans clustering with 2 clusters of sizes 6, 4
Cluster means:
x y
1 11.66667 604.3333
2 12.25000 3963.2500
Clustering vector:
[1] 2 2 1 2 1 1 1 1 1 2
If you use a logtransform of y, then these are the cluster results:
Kmeans clustering with 2 clusters of sizes 3, 7
Cluster means:
x logy
1 9.333333 6.861456
2 13.000000 7.132130
Clustering vector:
[1] 2 2 2 2 2 2 2 1 1 1
Neither of these is very satisfying. The untransformed version separates
only on y while the transformed version separates only on x.
On Tue, Jul 26, 2011 at 6:05 AM, Immo Micus <immomicus@googlemail.com>wrote:
> Hello,
>
> this is my first email to the mahoutuserlist.
> I am trying to do some clustering with mahout and i have a question
> concerning the clustercenter and clusterradius.
>
> For testing, i clustered 10 points using the KMeansClusterer:
>
> points:
> [13.000, 4455.000]
> [13.000, 5101.000]
> [13.000, 333.000]
> [13.000, 3412.000]
> [13.000, 823.000]
> [13.000, 238.000]
> [13.000 951.000]
> [ 9.000, 311.000]
> [ 9.000, 970.000]
> [10.000, 2885.000]
>
> This is the method i am using:
>
> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure,
> 10, 0.001);
>
> initial_clusters are 2 random points of the points above, measure is
> EuclideanDistanceMeasure.
>
>
> And this is the result of the converged clusters VL0 and VL1:
>
> VL0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
> VL1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>
> If i understand this output right then n is the number of points that are
> assigned to the cluster. c is the clustercenter and r is the radius of the
> cluster.
> So, every point belongs to either cluster 0 or cluster 1. Actually you can
> even guess what points belong to what cluster but i am confused by the
> calculated clustercenter and clusterradius:
> For example [ 9.000, 970.000] should belong to cluster 0, but 9.000 <
> 9.781 [11.667 1.886] and 970.000 > 919.392 [604.333 + 315.059]. The
> point is not in range of the cluster, it obviously does not belong to
> cluster 1 but all 10 points are assigned to clusters. Can someone please
> tell me where the mistake is?
>
>
> greetings, Immo
>
>
>
>
>
>
>
>
