mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Interpreting Cluster Dump Metrics
Date Fri, 24 May 2013 16:46:50 GMT
I'm trying to automate something like a hierarchical clustering and so looking for a good quality
metric. I can see no way to automate from the numbers I just got but it's a start. It was
for a very small data set.

You mention looking at intra-cluster average distance with held out data. Held-out, I assume,
means it was not used to calculate centroids or in determining cluster membership. Are you
proposing remeasuring the average distance from the closest centroid for these held-out docs?
Averaging together the ones that are closest to the same centroid, then averaging the averages
for all clusters?

I don't think I've heard of this before. Seems interesting is there a paper? 

On May 21, 2013, at 9:53 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> For this sample it looks like about 20-40 clusters is "best"? Looking at
> the results for k=40 by eyeball they do seem pretty good.


It is really hard to tell with these numbers.  IN spite of their heritage,
these scaled average distances are kind of debatable as things to compare,
if only because they are scaled differently.

My own tendency is to prefer to use unscaled intra-cluster average
distance.  This should monotonically decrease as k increases.  The
interesting question (for me) is what the same average is for held-out data.

This measure of quality is focused around the use of clustering as a
feature for downstream modeling, not necessarily for human consumption.


Mime
View raw message