This has several things that make my spidey sense tingle.
On Sat, Nov 6, 2010 at 5:29 PM, Lance Norskog wrote:
> I have a dataset of vectors in 150 dimensions. I'm playing with clustering.
> The vectors should be correlated in some way and so should be somewhat
> clusterable. The numerical space is 0.0 <= x <= 1.0 in all directions. The
> norm2 for the space is 1/sqrt(dimensions).
>
What does "norm2 for the space" mean? Normally a norm is applied to a
vector and as a side effect to a matrix.
> KMeans/FuzzyKMeans did not work at all.
That seems odd and somewhat unusual. 150 dimensions is larger than this
kind of clustering works well, but it seems like kmeans should have given
some kind of result. What did you observe?
> Dirichlet works with an AsymmetricSampledNormalDistribution. It stops after
> 24 iterations but will give as many clusters as requested. (I don't know if
> this is expected.)
>
Giving the number of clusters you specify is, I think, normal here.
> To evaluate these clusters, I am examining the radius of each cluster. The
> radius is a vector of distances for each dimension for the cluster vector. I
> normalize these to the 0 -> 1 space with the above norm2. I do this for my
> own limited mathematical intuitions.
>
This is a little unusual.
For k-means the normal things to look at are 0) the distribution of
distances between randomly distributed synthetic points, 1) the
distribution of distance between randomly selected data points, 2) the
distribution of distances between a point and a randomly selected
centroid and 3) the distribution of distances to the nearest centroid.
Looking at these for the training data and for held out data is ideal.
All of these distances should be computed without any normalization.
What you should look at includes:
- whether distribution 0 and distribution 1 are radically different.
Different is actually kind of good here because it means that your
points aren't just spread out all over
- how different 2 and 3 are and how different distribution 3 is for training
data and held out data. 2 and 3 should be distinctly different
and distribution 3 should be pretty similar for held out data.
For any clustering at all, I like to compare the number of points that are
clustered into the different clusters for held out data versus
for the training data. The proportions should be about the same.
The results:
> These radii, both in Canopy and Dirichlet, are all less than 1.0. Good
> first step. Since KMeans doesn't work, that means the clusters are probably
> asymmetric. The radii all have different norms. The 7 Canopy radii have,
> order, 5 roughly equal radii, one small and one near-zero, showing how
> Canopy closes in. The Dirichlet output is a different kettle of fish. First,
> all of the radii have several negative values. I had assumed that the radius
> values would all be positive. I assume this is a loose end in the Dirichlet
> implementation. I normalized them by adding the lowest negative value, and
> this is why all have a minimum value of 0.0.
>
I can't help with expectations for what these should look like. The
normalization makes it very hard to understand. How did you compute
distance to a Dirichlet cluster?
> Here are the Canopy and Dirichlet radius summaries.
How did you come up with a single radius here?
>