mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Streaming KMeans distance cutoff
Date Wed, 08 May 2013 16:48:52 GMT

On Wed, May 8, 2013 at 8:49 AM, Andy Twigg <> wrote:

> both of those make sense to me.
> On 8 May 2013 16:45, Dan Filimon <> wrote:
>> Hi Ted!
>> I recently talked to one of the authors of streaming k-means, Adam
>> Meyerson
>> asking about the distance cutoff as I wasn't sure of a right value for
>> this.
>> He told me two things:
>> - that we should multiply the distance / distanceCutoff ratio by the
>> weight
>> of the point we're trying to cluster so as to avoid collapsing larger
>> clusters
This makes sense, but I have no idea what the effect will be.

>  - the initial cutoff they use is 1 / numClusters basically
This is just wrong.

It is fine in theoretical settings with synthetic clusters of unit
characteristic scale, but is not invariant over uniform scaling so it can't
be correct.

>> As I tested the code on multiple well known data sets, this got me
>> thinking
>> of removing the distanceCutoff all together.
>> It seems like just another parameter to get right with only limited real
>> value of fiddling with it.
I think it is core to the algorithm.  It is adapted to the data in any

>> Additionally, clusterOvershoot, the thing we're using to delay distance
>> cutoff increases also seems somewhat unnecessary. Why use it and get a lot
>> more centroids than what we asked for.
Well, it needs to be there until we get more run-time.  Getting more
centroids doesn't hurt and getting too few definitely does hurt.  Thus we
need some insurance.

>> I want to post a final version for review, but I just wanted to mention
>> these two things.
>> It's not like they "hurt" really, they just don't seem to be helping too
>> much and I'd rather have something that more closely matches the
>> theoretical guarantees in the paper.
>> What do you think?
I think we already do better than some of the guarantees in the paper.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message