Maxim Arap edited comment on MAHOUT1468 at 4/23/14 5:00 PM:

Andrew: The default initial value for numClusters is 20, which seems arbitrary. As the algorithm
runs, numClusters will grow to roughly k log n, where k is the final number of clusters (that
BallKMeans step will output) and n is the size of the dataset. In practice k log n can be
much larger than 20, depending on the dataset and the final number of clusters k.
Suneel: I tried running the algorithm both in the sequential mode and in mapreduce mode on
Reuters data last night but both gave me runtime errors. The reason maybe that my laptop has
hadoop2.2.0, which may not be compatible with mahout at this point.
> Creating a new page for StreamingKMeans documentation on mahout website
> 
>
> Key: MAHOUT1468
> URL: https://issues.apache.org/jira/browse/MAHOUT1468
> Project: Mahout
> Issue Type: Documentation
> Components: Documentation
> Affects Versions: 1.0
> Reporter: Pavan Kumar N
> Assignee: Andrew Musselman
> Labels: Documentation
> Fix For: 1.0
>
> Attachments: StreamingKMeans.txt
>
>
> Separate page required on Streaming K Means algorithm description and overview, explaining
the various parameters can be used in streamingkmeans, strategy for parallelization, link
to this paper: http://papers.nips.cc/paper/3812streamingkmeansapproximation.pdf

