I think (casually, informally) that the guarantees are preserved by simple
ordering arguments. The argument goes that the threshold on partitions
will grow less than if the partitions were handled sequentially. Thus the
partial sketches should have at least as much fidelity than if the same
segment were before other segments. Secondly, the total number of sketch
clusters \kappa in M sketches across N points is
\kappa > k log N for the sequential case
and
\kappa > M ( k log N/M ) = M k log N  M k log M
But M << N so log N > log M. M >= 2 and M < sqrt(N) then
\kappa > M ( k log N/M ) = M k log N  M k log M >= M/2 k log N > k log
N
if M = 1, the bound becomes equality (whew)
This is a pretty reasonable constraint (i.e use less than 1000 mappers on
1e6 points and less than 10,000 mappers on 1e9 points).
Thus the mapreduce sketch is commonly more detailed in both senses for
common cases (smaller threshold, more sketch centroids). This should
result in as good or better theoretic results.
On Thu, Mar 28, 2013 at 10:30 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
> Dan/Ted:
>
> I like that you are implementing streaming kmeans.
>
> Are there any results comparing it to mini batch kmeans ([1] and the
> paper cited therein) ?
>
> In the distributed implementation, you independently compute a
> O(k)means clustering on each partition, then combine them into a
> final kmeans. Are there any guarantees/results about the accuracy of
> this? Clearly this sort of design also favours a storm/spark
> implementation  have you considered that?
>
> Andy
>
>
>
> [1]
> http://scikitlearn.org/dev/modules/generated/sklearn.cluster.MiniBatchKMeans.html
>
