mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Vector distance within a cluster
Date Wed, 27 Feb 2013 16:05:13 GMT
A common measure of cluster coherence is the mean distance or mean squared
difference between the members and the cluster centroid. It sounds like
this is the kind of thing you're measuring with this all-pairs distances.
That could be a measure too; I've usually seen that done by taking the
maximum such intracluster distance, the 'diameter'.

To answer Ted's question -- you're measuring internal consistency. You're
not trying to find clusters that match some external standard that says
these 100 docs should cluster together, etc.

I'm speaking off the cuff, but I think the idea was that L1/Manhattan
distance may give you clusters that tend to spread out over few rather than
more dimensions, and so that may make them more interpretable -- because
they will tend to be nearly identical in the other several dimensions and
those homogenous dimensions tell you what they're "about".

The reason is that L1 is "indifferent" across dimensions -- moving a unit
in any dimension makes you a unit further/closer from another point --
while in L2 moving along a dimension where you are already close does
little.

On Wed, Feb 27, 2013 at 3:23 PM, Chris Harrington <chris@heystaks.com>wrote:

> Hmmm, you may have to dumb things down for me here. I have don't have much
> of a background in the area of ML and I'm just piecing things together and
> learning as I go.
> So I don't really understand what you mean by "Coherence against an
> external standard?  Or internal consistency/homogeneity?" or "One thought
> along these lines is to add L_1 regularization to the k-means algorithm."
> Is L_1 regularization the same as manhattan distance?
>
> That aside I'm outputting a file with the top terms and the text of 20
> random documents that ended up in that cluster and eyeballing that, not
> very high-tech or efficient but it was the only way I knew to make a
> relevance judgment on a cluster topic. For example If the majority of the
> samples are sport related and 82.6% of the vector distances in my cluster
> are quite similar I'm happy to call that cluster sport.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message