# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Lance Norskog <goks...@gmail.com>
Subject Re: Clusters and radii interpretation
Date Mon, 08 Nov 2010 01:29:47 GMT
```Snerk! Given that I don't know what I'm doing, that's not a surprise.

> How did you come up with a single radius here?
I made 7 vectors with Canopy, and fed them to k-means. The only output
from k-means that I can see are the center, centroid, and radius
vectors. It does not seem to have a list of the data vectors that it
contains. But I have 7 radius vectors, not one.

Anyway, I've made progress. I went off to KNime and discovered A) a
k-means that will assign test data points to the k-means partition
created from the training set, and B) an MDS that lets me visualize
the 'clusterability'.

MDS is a dimension reduction algorithm for reducing multi-dimensional
vectors to 2 or 3 dimensions. KNime lets me visualize:
* the k-means centers for the training data
* the k-means partition from the training data applied to the test data
* the partition applied to random data

The test data and random data plots looked the same. The training data
created a beautiful 2D normal distribution plot, centered at the
center. The test and random data both created a more random plot,
recognizably normal distribution, but centered off to the side. This
distributions.

This whole exercise has confirmed that my vector generation does a good job.

KNime is great for this.  www.knime.org - cannot recommend it highly
enough for data mining, especially for beginners.

On Sat, Nov 6, 2010 at 6:49 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> This has several things that make my spidey sense tingle.
>
> On Sat, Nov 6, 2010 at 5:29 PM, Lance Norskog <goksron@gmail.com> wrote:
>
>> I have a dataset of vectors in 150 dimensions. I'm playing with clustering.
>> The vectors should be correlated in some way and so should be somewhat
>> clusterable. The numerical space is 0.0 <= x <= 1.0 in all directions. The
>> norm2 for the space is 1/sqrt(dimensions).
>>
>
> What does "norm2 for the space" mean?  Normally a norm is applied to a
> vector and as a side effect to a matrix.
>
>
>> KMeans/FuzzyKMeans did not work at all.
>
>
> That seems odd and somewhat unusual.  150 dimensions is larger than this
> kind of clustering works well, but it seems like kmeans should have given
> some kind of result.  What did you observe?
>
>
>> Dirichlet works with an AsymmetricSampledNormalDistribution. It stops after
>> 24 iterations but will give as many clusters as requested. (I don't know if
>> this is expected.)
>>
>
> Giving the number of clusters you specify is, I think, normal here.
>
>
>> To evaluate these clusters, I am examining the radius of each cluster. The
>> radius is a vector of distances for each dimension for the cluster vector. I
>> normalize these to the 0 -> 1 space with the above norm2. I do this for my
>> own limited mathematical intuitions.
>>
>
> This is a little unusual.
>
> For k-means the normal things to look at are 0) the distribution of
> distances between randomly distributed synthetic points, 1) the
> distribution of distance between randomly selected data points, 2) the
> distribution of distances between a point and a randomly selected
> centroid and 3) the distribution of distances to the nearest centroid.
>  Looking at these for the training data and for held out data is ideal.
> All of these distances should be computed without any normalization.
>
> What you should look at includes:
>
> -  whether distribution 0 and distribution 1 are radically different.
>  Different is actually kind of good here because it means that your
> points aren't just spread out all over
>
> - how different 2 and 3 are and how different distribution 3 is for training
> data and held out data.  2 and 3 should be distinctly different
> and distribution 3 should be pretty similar for held out data.
>
> For any clustering at all, I like to compare the number of points that are
> clustered into the different clusters for held out data versus
> for the training data.  The proportions should be about the same.
>
> The results:
>> These radii, both in Canopy and Dirichlet, are all less than 1.0. Good
>> first step. Since KMeans doesn't work, that means the clusters are probably
>> asymmetric. The radii all have different norms. The 7 Canopy radii have,
>> order, 5 roughly equal radii, one small and one near-zero, showing how
>> Canopy closes in. The Dirichlet output is a different kettle of fish. First,
>> values would all be positive. I assume this is a loose end in the Dirichlet
>> implementation. I normalized them by adding the lowest negative value, and
>> this is why all have a minimum value of 0.0.
>>
>
> I can't help with expectations for what these should look like.  The
> normalization makes it very hard to understand.  How did you compute
> distance to a Dirichlet cluster?
>
>
>> Here are the Canopy and Dirichlet radius summaries.
>
>
> How did you come up with a single radius here?
>
>
>>
>

--
Lance Norskog
goksron@gmail.com

```
Mime
View raw message