[ https://issues.apache.org/jira/browse/MAHOUT1218?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13660434#comment13660434
]
Dan Filimon commented on MAHOUT1218:

This is still an issue I think.
If k is the number of clusters, n the number of points and m is the number of mappers, there
should be (k log n) clusters after the map phase.
BUT, the meaning of the km parameter is the number of clusters to generate from ONE mapper.
So, if it's set to the recommended (k log n) for each mapper we'll end up with m (k log n)
clusters which is way more than we need. This causes the reducer to fail because it can't
fit all the points in memory.
For a given mapper, assuming the splits are equal, if there are (n / m) points to cluster,
the number of clusters it should create is (k log (n / m)).
For the concrete case we were playing with, n = 4*10^6, k = 20, m = 114 (and the log is actually
ln).
 So, if we follow the recommendation, and have km = k log n for every mapper, that's 305
clusters per mapper. So, in total the reducer has to cluster about 34660 points.
 If we go for, km = k log (n/m) for every mapper, that's 204 clusters per mapper. Which gets
us about 24K points to cluster at the end.
 If we go even further, and suggest km = (k log n)/m for every mapper, that's about 2 clusters
per mapper. That's probably too little to capture the clustering of those points, but chances
are it'll still do a good job if km = 100.
And in any case, for any value of km, it is just an initial estimate, so we don't care about
what _exactly_ the value is.
As we're streaming points through the mapper, the number of clusters gets increased to k *
log (numProcessedDatapoints).
It might even make sense to have fewer than that many clusters at the beginning to ensure
there aren't too many collapses.
So, I think it's safe to say that the precondition in the driver is a bit excessive. Maybe
pop up a warning, but I don't think it should prevent the job from running. It's not like
a small km is violating any assumptions.
Somewhat offtopic:
Looking at the paper again, what is the base of the logarithm exactly? It makes a difference
in these calculations, and in the code we're using log base e. I would have expected the base
to be 2, but honestly, since it's just O(k log n), the actual base doesn't matter in the asymptotic
notation.
Still, what is the right value for the code?
> Streamimg kmeans fails when the number of clusters specified is <= estimated map
clusters
> 
>
> Key: MAHOUT1218
> URL: https://issues.apache.org/jira/browse/MAHOUT1218
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 0.8
>
>
> Running Streaming kmeans with CosineDistanceMeasure, Fast Projection Search, number
of clusters k= 60, number of estimated map clsuters km = 60.
> {Code}
> Exception in thread "main" java.lang.IllegalArgumentException: Invalid number of estimated
map clusters; There must be more than the final number of clusters (k log n vs k)
> at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.configureOptionsForWorkers(StreamingKMeansDriver.java:327)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.configureOptionsForWorkers(StreamingKMeansDriver.java:280)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:227)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:472)
> {Code}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
