I have a dataset of vectors in 150 dimensions. I'm playing with
clustering. The vectors should be correlated in some way and so should
be somewhat clusterable. The numerical space is 0.0 <= x <= 1.0 in all
directions. The norm2 for the space is 1/sqrt(dimensions).
KMeans/FuzzyKMeans did not work at all. Dirichlet works with an
AsymmetricSampledNormalDistribution. It stops after 24 iterations but
will give as many clusters as requested. (I don't know if this is expected.)
To evaluate these clusters, I am examining the radius of each cluster.
The radius is a vector of distances for each dimension for the cluster
vector. I normalize these to the 0 -> 1 space with the above norm2. I do
this for my own limited mathematical intuitions.
The results:
These radii, both in Canopy and Dirichlet, are all less than 1.0. Good
first step. Since KMeans doesn't work, that means the clusters are
probably asymmetric. The radii all have different norms. The 7 Canopy
radii have, order, 5 roughly equal radii, one small and one near-zero,
showing how Canopy closes in. The Dirichlet output is a different kettle
of fish. First, all of the radii have several negative values. I had
assumed that the radius values would all be positive. I assume this is a
loose end in the Dirichlet implementation. I normalized them by adding
the lowest negative value, and this is why all have a minimum value of 0.0.
Here are the Canopy and Dirichlet radius summaries. Min/Max/Norm come
from the Vector implementation functions. Stddev is from the
StandardDeviation class. Min/Max show the maximum skew of the radius
oval, and the norm2 is a measure of the N-dimensional size of the oval.
Interpretation of Canopy: the norm2 values of 0.07 to 0.7 indicate very
small to very large ovals. The stddev indicate a similarly wide range
from rounded to extreme ovals.
Interpretation of Dirichlet: the norm2 values are from 0.18 to 0.30. The
stddev values are in a similarly narrow range. Thus, Dirichlet was much
better at finding good clusters.
Here are the raw data:
Canopies:
Stopped at 7 iterations. This is probably a function of my control
values, but I don't understand them.
radius min: 0.127848, max: 0.82675, norm2: 0.10244, stddev: 0.97049228
radius min: 0.428303, max: 0.14688, norm2: 0.200042, stddev: 0.4691054
radius min: 0.953668, max: 0.037329, norm2: 0.076551, stddev: 0.1969004
radius min: 0.66706, max: 0.177616, norm2: 0.143568, stddev: 0.533347
radius min: 0.3834656, max: 0.093145, norm2: 0.771413, stddev: 0.2727437
radius min: 1.97559E-4, max: 0.26654, norm2: 0.476613, stddev: 0.883297
radius min: 2.72014E-308, max: 2.72014E-308, norm2: 0.0, stddev: 0.0
Dirichlet Clusters:
Allowed 50 iterations. Stopped at 24.
Length of cluster: 20
radius: min: 0.0, max: 0.4351731317613352, norm2: 0.23724015155870853,
stddev: 0.08589605994209457
radius: min: 0.0, max: 0.43454768778264424, norm2: 0.2182967655938718,
stddev: 0.07820922525108506
radius: min: 0.0, max: 0.4257544561005417, norm2: 0.2347278105757725,
stddev: 0.08638504183578334
radius: min: 0.0, max: 0.3898861055534767, norm2: 0.19936157038662167,
stddev: 0.07993022684323048
radius: min: 0.0, max: 0.4185431782190273, norm2: 0.23324067341509705,
stddev: 0.08034437465659874
radius: min: 0.0, max: 0.48882417838386466, norm2: 0.2787118963689076,
stddev: 0.08671269891849095
radius: min: 0.0, max: 0.4090508499677522, norm2: 0.22883939621598232,
stddev: 0.07748583078136284
radius: min: 0.0, max: 0.4325558610552059, norm2: 0.2603226102820014,
stddev: 0.07554530388950208
radius: min: 0.0, max: 0.39777198040477896, norm2: 0.24684110862692255,
stddev: 0.08672454426749251
radius: min: 0.0, max: 0.531677760146581, norm2: 0.28330820569569837,
stddev: 0.08448515053243884
radius: min: 0.0, max: 0.42377556269801, norm2: 0.22890581071124907,
stddev: 0.08540251357878932
radius: min: 0.0, max: 0.4472174697924406, norm2: 0.20354417891408141,
stddev: 0.08067777317911734
radius: min: 0.0, max: 0.3774646209477964, norm2: 0.2016034439565245,
stddev: 0.08120738045804161
radius: min: 0.0, max: 0.41582209335459364, norm2: 0.25225877921586715,
stddev: 0.08871816315297622
radius: min: 0.0, max: 0.4879159014228414, norm2: 0.21942117011373538,
stddev: 0.08855141098098554
radius: min: 0.0, max: 0.4270525201114075, norm2: 0.20018637560733332,
stddev: 0.08140121090799231
radius: min: 0.0, max: 0.4722323927707502, norm2: 0.27442792099816604,
stddev: 0.08298142189530944
radius: min: 0.0, max: 0.37702578805324927, norm2: 0.23873257664491646,
stddev: 0.0792203837674309
radius: min: 0.0, max: 0.3704620593571561, norm2: 0.19700023808010425,
stddev: 0.08031515132089997
radius: min: 0.0, max: 0.46505290711258623, norm2: 0.26738097650066134,
stddev: 0.07475626794172856
~
|