mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Clusters and radii interpretation
Date Sun, 07 Nov 2010 00:29:41 GMT
I have a dataset of vectors in 150 dimensions. I'm playing with 
clustering. The vectors should be correlated in some way and so should 
be somewhat clusterable. The numerical space is 0.0 <= x <= 1.0 in all 
directions. The norm2 for the space is 1/sqrt(dimensions).

KMeans/FuzzyKMeans did not work at all. Dirichlet works with an 
AsymmetricSampledNormalDistribution. It stops after 24 iterations but 
will give as many clusters as requested. (I don't know if this is expected.)

To evaluate these clusters, I am examining the radius of each cluster. 
The radius is a vector of distances for each dimension for the cluster 
vector. I normalize these to the 0 -> 1 space with the above norm2. I do 
this for my own limited mathematical intuitions.

The results:
These radii, both in Canopy and Dirichlet, are all less than 1.0. Good 
first step. Since KMeans doesn't work, that means the clusters are 
probably asymmetric. The radii all have different norms. The 7 Canopy 
radii have, order, 5 roughly equal radii, one small and one near-zero, 
showing how Canopy closes in. The Dirichlet output is a different kettle 
of fish. First, all of the radii have several negative values. I had 
assumed that the radius values would all be positive. I assume this is a 
loose end in the Dirichlet implementation. I normalized them by adding 
the lowest negative value, and this is why all have a minimum value of 0.0.

Here are the Canopy and Dirichlet radius summaries. Min/Max/Norm come 
from the Vector implementation functions. Stddev is from the 
StandardDeviation class. Min/Max show the maximum skew of the radius 
oval, and the norm2 is a measure of the N-dimensional size of the oval.

Interpretation of Canopy: the norm2 values of 0.07 to 0.7 indicate very 
small to very large ovals. The stddev indicate a similarly wide range 
from rounded to extreme ovals.

Interpretation of Dirichlet: the norm2 values are from 0.18 to 0.30. The 
stddev values are in a similarly narrow range. Thus, Dirichlet was much 
better at finding good clusters.

Here are the raw data:

Canopies:
Stopped at 7 iterations. This is probably a function of my control 
values, but I don't understand them.

radius min: 0.127848, max: 0.82675, norm2: 0.10244, stddev: 0.97049228
radius min: 0.428303, max: 0.14688, norm2: 0.200042, stddev: 0.4691054
radius min: 0.953668, max: 0.037329, norm2: 0.076551, stddev: 0.1969004
radius min: 0.66706, max: 0.177616, norm2: 0.143568, stddev: 0.533347
radius min: 0.3834656, max: 0.093145, norm2: 0.771413, stddev: 0.2727437
radius min: 1.97559E-4, max: 0.26654, norm2: 0.476613, stddev: 0.883297
radius min: 2.72014E-308, max: 2.72014E-308, norm2: 0.0, stddev: 0.0



Dirichlet Clusters:
Allowed 50 iterations. Stopped at 24.
Length of cluster: 20
radius: min: 0.0, max: 0.4351731317613352, norm2: 0.23724015155870853, 
stddev: 0.08589605994209457
radius: min: 0.0, max: 0.43454768778264424, norm2: 0.2182967655938718, 
stddev: 0.07820922525108506
radius: min: 0.0, max: 0.4257544561005417, norm2: 0.2347278105757725, 
stddev: 0.08638504183578334
radius: min: 0.0, max: 0.3898861055534767, norm2: 0.19936157038662167, 
stddev: 0.07993022684323048
radius: min: 0.0, max: 0.4185431782190273, norm2: 0.23324067341509705, 
stddev: 0.08034437465659874
radius: min: 0.0, max: 0.48882417838386466, norm2: 0.2787118963689076, 
stddev: 0.08671269891849095
radius: min: 0.0, max: 0.4090508499677522, norm2: 0.22883939621598232, 
stddev: 0.07748583078136284
radius: min: 0.0, max: 0.4325558610552059, norm2: 0.2603226102820014, 
stddev: 0.07554530388950208
radius: min: 0.0, max: 0.39777198040477896, norm2: 0.24684110862692255, 
stddev: 0.08672454426749251
radius: min: 0.0, max: 0.531677760146581, norm2: 0.28330820569569837, 
stddev: 0.08448515053243884
radius: min: 0.0, max: 0.42377556269801, norm2: 0.22890581071124907, 
stddev: 0.08540251357878932
radius: min: 0.0, max: 0.4472174697924406, norm2: 0.20354417891408141, 
stddev: 0.08067777317911734
radius: min: 0.0, max: 0.3774646209477964, norm2: 0.2016034439565245, 
stddev: 0.08120738045804161
radius: min: 0.0, max: 0.41582209335459364, norm2: 0.25225877921586715, 
stddev: 0.08871816315297622
radius: min: 0.0, max: 0.4879159014228414, norm2: 0.21942117011373538, 
stddev: 0.08855141098098554
radius: min: 0.0, max: 0.4270525201114075, norm2: 0.20018637560733332, 
stddev: 0.08140121090799231
radius: min: 0.0, max: 0.4722323927707502, norm2: 0.27442792099816604, 
stddev: 0.08298142189530944
radius: min: 0.0, max: 0.37702578805324927, norm2: 0.23873257664491646, 
stddev: 0.0792203837674309
radius: min: 0.0, max: 0.3704620593571561, norm2: 0.19700023808010425, 
stddev: 0.08031515132089997
radius: min: 0.0, max: 0.46505290711258623, norm2: 0.26738097650066134, 
stddev: 0.07475626794172856
~




Mime
View raw message