mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Streaming kmeans question
Date Mon, 28 Jul 2014 19:45:34 GMT

I am traveling and it is difficult to get a real internet connection. 


Here is an answer one of your questions. 

For very dimension data, some kind of dimensionality reduction is usually important. The streaming
k-means code does the by approximating the nearest centroid by using a random projection.


Note that the output of the streaming step is *not* a set of initial centroids. Instead it
is a large number of centroids which are clustered as a surrogate for the original data. 
These centroids are much less numerous than the original data so the final ball k-means can
run in memory. This is very different than the canopy approach. 

There is a known issue with the map-reduce version of the streaming k-means program that causes
the number of centroids output by the parallel part of the algorithm to be too large. 

There is a known issue


Sent from my iPhone

> On Jul 28, 2014, at 3:08, Bojan Kostić <blood9raven@gmail.com> wrote:
> 
> Also as i see this stream kmeans is for large sets of data. Does this large
> means large number of points and not dimmensions? And what to do when data
> have large dimensions? Like more then 1000000 dimensions.

Mime
View raw message