mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Streaming KMeans clustering
Date Sun, 29 Dec 2013 22:40:22 GMT
Here is a link to the image.  Nothing very exciting.

https://dl.dropboxusercontent.com/u/36863361/iris.png




On Sat, Dec 28, 2013 at 4:32 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

>
> On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
>> Okay, understood. For a lot of clusters (which i don't necessarily
>> attribute to big data problems but definetly to nearest neighbour like
>> usage of clusters), the every "cluster sees every point" doesnt scale
>> well.
>>
>
>
> Nearest neighbor sorts of problems.  Or market segmentation kinds of
> problems.  Or feature generation kinds of problems.  Or volume quantization
> kinds of problems.
>
> The point is that with more data, you can derive a more detailed model.
>  That means more clusters.
>
> The ideal case is that we have an infinite mixture model to describe the
> data distribution, but we haven't seen most of the mixture components yet.
>  As we get more data, we can justify saying we have more components.
>
> Even if you think that there are *really* 10 clusters in the data, I can
> make the case that you are going to get a better unsupervised description
> of those clusters by using 100 or more clusters in the k-means algorithm.
>  The idea is that each of your real clusters will be composed of several of
> the k-means clusters and finding the attachments is much easier because 100
> clusters is much smaller than the original number of samples.
>
> As a small data example, suppose you are looking at Anderson's Iris data
> which is available built into R.  If we plot the 150 data points against
> the features we have in pairs, we can see various patterns and see that a
> non-linear classifier should do quite well in separating the classes (I
> hope the image makes it through the emailer):
>
> [image: Inline image 1]
>
> But if we try to do k-means on these data with only 3 clusters, we get
> very poor assignment to the three species:
>
>
> *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> *> table(iris$Species, k$cluster)*
>
>               cluster
>               1  2  3
>   setosa*     50  0  0*
>   versicolor*  0 48  2*
>   virginica*   0 14 36*
>
> Cluster 1 captured the isolated setosa species rather well, but versicolor
> and virginica are not well separated because cluster 2 has 80% of
> versicolor and 20% of virginica.
>
> On the other hand, if we use 7 clusters,
>
>
> *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
>
> *> table(iris$Species, k$cluster)*
>
>                    cluster
>               1  2  3  4  5  6  7
>   setosa*      0  0 28  0 22  0  0*
>   versicolor*  0  7  0 20  0  0 23*
>   virginica*  12  0  0  1  0 24 13*
>
> Each cluster is now composed of almost exactly one species.  Only cluster
> 4 has any impurity and it is 95% composed of just versicolor samples.
>
> What this means is that we can use the 7 cluster k-means results to build
> a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
> values.  That is, we have compressed the 4 original continuous features
> down to about 2.7 bits on average and this compressed representation
> actually makes building a classifier nearly trivial.
>
> *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> *2.670288*
>
> This is pretty cool compression given that the original continuous data
> had about 22 bits of raw information capacity based on the range and
> precision given.
>
> Now, in the real world, we would need to hold out some data for cross
> validation of the clustering, but the fact remains that if you want to use
> the k-means clustering for some machine oriented purpose, it pays to have
> as many clusters as you can have, subject to the held-out data agreeing
> with the original data on distance distributions and counts for the
> clusters.
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message