mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: Validating clustering output
Date Tue, 28 Jul 2009 01:51:49 GMT
Brown and DiPietro's algorithm for clustering based on entropy is
somewhat infamous for the difficulty of achieving usable performance.
Mike Collins was responsible for a famously speedy version. Having
build one that is just barely fast enough in C++, I wouldn't recommend
trying it in Java. Of course, you aren't proposing that, just
recommending the bigram entropy metric or something like it.

On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<> wrote:
> (vastly delayed response ... huge distractions competing with more than 2
> minutes answers are to blame)
> Grant,
> For evaluating clustering for symbol sequences:
> Most of the other references I have found talk about quality relative to
> gold standard judgments about whether exemplars are in the same class or
> relative to similarity/distinctiveness ratios.  Neither is all that
> satisfactory.
> My preference is an entropic measure that describes how much of the
> information in your data is captured by the clustering vs how much residual
> info there is.
> The other reference I am looking for may be in David Mackay's book.  The
> idea is that you measure the quality of the approximation by looking at the
> entropy in the cluster assignment relative to the residual required to
> precisely specify the original data relative to the quantized value.
> This is also related to trading off signal/noise in a vector quantizer.
> David,  do you have a moment to talk about this with me?  I can't free up
> the time to chase these final references and come up with a nice formula for
> this.  I think you could do it in 10-20 minutes.
> On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <>wrote:
>> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
>>  A principled approach to cluster evaluation is to measure how well the
>>> cluster membership captures the structure of unseen data.  A natural
>>> measure
>>> for this is to measure how much of the entropy of the data is captured by
>>> cluster membership.  For k-means and its natural L_2 metric, the natural
>>> cluster quality metric is the squared distance from the nearest centroid
>>> adjusted by the log_2 of the number of clusters.  This can be compared to
>>> the squared magnitude of the original data or the squared deviation from
>>> the
>>> centroid for all of the data.  The idea is that you are changing the
>>> representation of the data by allocating some of the bits in your original
>>> representation to represent which cluster each point is in.  If those bits
>>> aren't made up by the residue being small then your clustering is making a
>>> bad trade-off.
>>> In the past, I have used other more heuristic measures as well.  One of
>>> the
>>> key characteristics that I would like to see out of a clustering is a
>>> degree
>>> of stability.  Thus, I look at the fractions of points that are assigned
>>> to
>>> each cluster or the distribution of distances from the cluster centroid.
>>> These values should be relatively stable when applied to held-out data.
>>> For text, you can actually compute perplexity which measures how well
>>> cluster membership predicts what words are used.  This is nice because you
>>> don't have to worry about the entropy of real valued numbers.
>> Do you have any references on any of the above approaches?
> --
> Ted Dunning, CTO
> DeepDyve
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> 858-414-0013 (m)
> 408-773-0220 (fax)

View raw message