mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: streaming k-means vs minibatch k-means
Date Fri, 29 Mar 2013 02:29:53 GMT
I think (casually, informally) that the guarantees are preserved by simple
ordering arguments.  The argument goes that the threshold on partitions
will grow less than if the partitions were handled sequentially.  Thus the
partial sketches should have at least as much fidelity than if the same
segment were before other segments.  Secondly, the total number of sketch
clusters \kappa in M sketches across N points is

   \kappa > k log N for the sequential case

and

    \kappa > M ( k log N/M ) = M k log N - M k log M

But M << N so log N > log M. M >= 2 and M < sqrt(N) then

    \kappa > M ( k log N/M ) = M k log N - M k log M >= M/2 k log N > k log
N

if M = 1, the bound becomes equality (whew)

This is a pretty reasonable constraint (i.e use less than 1000 mappers on
1e6 points and less than 10,000 mappers on 1e9 points).

Thus the map-reduce sketch is commonly more detailed in both senses for
common cases (smaller threshold, more sketch centroids).  This should
result in as good or better theoretic results.


On Thu, Mar 28, 2013 at 10:30 PM, Andy Twigg <andy.twigg@gmail.com> wrote:

> Dan/Ted:
>
> I like that you are implementing streaming k-means.
>
> Are there any results comparing it to mini batch k-means ([1] and the
> paper cited therein) ?
>
> In the distributed implementation, you independently compute a
> O(k)-means clustering on each partition, then combine them into a
> final k-means. Are there any guarantees/results about the accuracy of
> this? Clearly this sort of design also favours a storm/spark
> implementation - have you considered that?
>
> -Andy
>
>
>
> [1]
> http://scikit-learn.org/dev/modules/generated/sklearn.cluster.MiniBatchKMeans.html
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message