Here is a link to the image. Nothing very exciting.
https://dl.dropboxusercontent.com/u/36863361/iris.png
On Sat, Dec 28, 2013 at 4:32 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
>> Okay, understood. For a lot of clusters (which i don't necessarily
>> attribute to big data problems but definetly to nearest neighbour like
>> usage of clusters), the every "cluster sees every point" doesnt scale
>> well.
>>
>
>
> Nearest neighbor sorts of problems. Or market segmentation kinds of
> problems. Or feature generation kinds of problems. Or volume quantization
> kinds of problems.
>
> The point is that with more data, you can derive a more detailed model.
> That means more clusters.
>
> The ideal case is that we have an infinite mixture model to describe the
> data distribution, but we haven't seen most of the mixture components yet.
> As we get more data, we can justify saying we have more components.
>
> Even if you think that there are *really* 10 clusters in the data, I can
> make the case that you are going to get a better unsupervised description
> of those clusters by using 100 or more clusters in the kmeans algorithm.
> The idea is that each of your real clusters will be composed of several of
> the kmeans clusters and finding the attachments is much easier because 100
> clusters is much smaller than the original number of samples.
>
> As a small data example, suppose you are looking at Anderson's Iris data
> which is available built into R. If we plot the 150 data points against
> the features we have in pairs, we can see various patterns and see that a
> nonlinear classifier should do quite well in separating the classes (I
> hope the image makes it through the emailer):
>
> [image: Inline image 1]
>
> But if we try to do kmeans on these data with only 3 clusters, we get
> very poor assignment to the three species:
>
>
> *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> *> table(iris$Species, k$cluster)*
>
> cluster
> 1 2 3
> setosa* 50 0 0*
> versicolor* 0 48 2*
> virginica* 0 14 36*
>
> Cluster 1 captured the isolated setosa species rather well, but versicolor
> and virginica are not well separated because cluster 2 has 80% of
> versicolor and 20% of virginica.
>
> On the other hand, if we use 7 clusters,
>
>
> *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
>
> *> table(iris$Species, k$cluster)*
>
> cluster
> 1 2 3 4 5 6 7
> setosa* 0 0 28 0 22 0 0*
> versicolor* 0 7 0 20 0 0 23*
> virginica* 12 0 0 1 0 24 13*
>
> Each cluster is now composed of almost exactly one species. Only cluster
> 4 has any impurity and it is 95% composed of just versicolor samples.
>
> What this means is that we can use the 7 cluster kmeans results to build
> a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
> values. That is, we have compressed the 4 original continuous features
> down to about 2.7 bits on average and this compressed representation
> actually makes building a classifier nearly trivial.
>
> *> sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> *2.670288*
>
> This is pretty cool compression given that the original continuous data
> had about 22 bits of raw information capacity based on the range and
> precision given.
>
> Now, in the real world, we would need to hold out some data for cross
> validation of the clustering, but the fact remains that if you want to use
> the kmeans clustering for some machine oriented purpose, it pays to have
> as many clusters as you can have, subject to the heldout data agreeing
> with the original data on distance distributions and counts for the
> clusters.
>
