mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Clusters and radii interpretation
Date Mon, 08 Nov 2010 01:29:47 GMT
Snerk! Given that I don't know what I'm doing, that's not a surprise.

> How did you come up with a single radius here?
I made 7 vectors with Canopy, and fed them to k-means. The only output
from k-means that I can see are the center, centroid, and radius
vectors. It does not seem to have a list of the data vectors that it
contains. But I have 7 radius vectors, not one.

Anyway, I've made progress. I went off to KNime and discovered A) a
k-means that will assign test data points to the k-means partition
created from the training set, and B) an MDS that lets me visualize
the 'clusterability'.

MDS is a dimension reduction algorithm for reducing multi-dimensional
vectors to 2 or 3 dimensions. KNime lets me visualize:
* the k-means centers for the training data
* the k-means partition from the training data applied to the test data
* the partition applied to random data

The test data and random data plots looked the same. The training data
created a beautiful 2D normal distribution plot, centered at the
center. The test and random data both created a more random plot,
recognizably normal distribution, but centered off to the side. This
follows your advice that test and training data should have different
distributions.

This whole exercise has confirmed that my vector generation does a good job.

KNime is great for this.  www.knime.org - cannot recommend it highly
enough for data mining, especially for beginners.

On Sat, Nov 6, 2010 at 6:49 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> This has several things that make my spidey sense tingle.
>
> On Sat, Nov 6, 2010 at 5:29 PM, Lance Norskog <goksron@gmail.com> wrote:
>
>> I have a dataset of vectors in 150 dimensions. I'm playing with clustering.
>> The vectors should be correlated in some way and so should be somewhat
>> clusterable. The numerical space is 0.0 <= x <= 1.0 in all directions. The
>> norm2 for the space is 1/sqrt(dimensions).
>>
>
> What does "norm2 for the space" mean?  Normally a norm is applied to a
> vector and as a side effect to a matrix.
>
>
>> KMeans/FuzzyKMeans did not work at all.
>
>
> That seems odd and somewhat unusual.  150 dimensions is larger than this
> kind of clustering works well, but it seems like kmeans should have given
> some kind of result.  What did you observe?
>
>
>> Dirichlet works with an AsymmetricSampledNormalDistribution. It stops after
>> 24 iterations but will give as many clusters as requested. (I don't know if
>> this is expected.)
>>
>
> Giving the number of clusters you specify is, I think, normal here.
>
>
>> To evaluate these clusters, I am examining the radius of each cluster. The
>> radius is a vector of distances for each dimension for the cluster vector. I
>> normalize these to the 0 -> 1 space with the above norm2. I do this for my
>> own limited mathematical intuitions.
>>
>
> This is a little unusual.
>
> For k-means the normal things to look at are 0) the distribution of
> distances between randomly distributed synthetic points, 1) the
> distribution of distance between randomly selected data points, 2) the
> distribution of distances between a point and a randomly selected
> centroid and 3) the distribution of distances to the nearest centroid.
>  Looking at these for the training data and for held out data is ideal.
> All of these distances should be computed without any normalization.
>
> What you should look at includes:
>
> -  whether distribution 0 and distribution 1 are radically different.
>  Different is actually kind of good here because it means that your
> points aren't just spread out all over
>
> - how different 2 and 3 are and how different distribution 3 is for training
> data and held out data.  2 and 3 should be distinctly different
> and distribution 3 should be pretty similar for held out data.
>
> For any clustering at all, I like to compare the number of points that are
> clustered into the different clusters for held out data versus
> for the training data.  The proportions should be about the same.
>
> The results:
>> These radii, both in Canopy and Dirichlet, are all less than 1.0. Good
>> first step. Since KMeans doesn't work, that means the clusters are probably
>> asymmetric. The radii all have different norms. The 7 Canopy radii have,
>> order, 5 roughly equal radii, one small and one near-zero, showing how
>> Canopy closes in. The Dirichlet output is a different kettle of fish. First,
>> all of the radii have several negative values. I had assumed that the radius
>> values would all be positive. I assume this is a loose end in the Dirichlet
>> implementation. I normalized them by adding the lowest negative value, and
>> this is why all have a minimum value of 0.0.
>>
>
> I can't help with expectations for what these should look like.  The
> normalization makes it very hard to understand.  How did you compute
> distance to a Dirichlet cluster?
>
>
>> Here are the Canopy and Dirichlet radius summaries.
>
>
> How did you come up with a single radius here?
>
>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message