mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Harrington <>
Subject Re: Vector distance within a cluster
Date Wed, 27 Feb 2013 15:23:07 GMT
Hmmm, you may have to dumb things down for me here. I have don't have much of a background
in the area of ML and I'm just piecing things together and learning as I go.
So I don't really understand what you mean by "Coherence against an external standard?  Or
internal consistency/homogeneity?" or "One thought along these lines is to add L_1 regularization
to the k-means algorithm."
Is L_1 regularization the same as manhattan distance? 

That aside I'm outputting a file with the top terms and the text of 20 random documents that
ended up in that cluster and eyeballing that, not very high-tech or efficient but it was the
only way I knew to make a relevance judgment on a cluster topic. For example If the majority
of the samples are sport related and 82.6% of the vector distances in my cluster are quite
similar I'm happy to call that cluster sport.

On 26 Feb 2013, at 22:00, Ted Dunning wrote:

> Chris,
> How are you doing your manual judgement step?  Coherence against an
> external standard?  Or internal consistency/homogeneity?
> Except for unusual situations it is to be expected that most clusterings
> are not particularly stable (i.e. will no reproduce the same clusters from
> run to run).  As such, it is also unlikely that they will reproduce
> externally defined clusters any more than they will reproduce their own
> results.
> Likewise, there is no guarantee that the results will be easily
> interpretable.  One thought along these lines is to add L_1 regularization
> to the k-means algorithm.  Another is to look into what the carrot project
> has done where, according to the developers, they have put some effort into
> making clusters that are easily summarizable.  This might be similar in
> effect to the regularization step I just mentioned.
> On Tue, Feb 26, 2013 at 7:02 AM, Chris Harrington <>wrote:
>> Well, what I'm trying to do is create clusters of topically similar
>> content via kmeans.
>> Since I'm basing validity on topics there's a manual judgement step.
>> And that manual step is taking a prohibitive amount of time to heck many
>> clustering runs hence the desire for some stats to indicate roughly how
>> good the clusters are.
>> So I' want some stats that, at a glance, I'll be able to tell which
>> clusters "should" be good and manually check them instead of having to
>> check each and every one.
>> I was thinking that a file with
>> 1. the number of clusters,
>> 2. the avg of all points to every other point
>> 3. the avg distance of the points furthest from the center to all other
>> points, (furthest 25% of all points within a cluster)
>> 4. the avg distance of the points closest to the center to all other point
>> (closest 25% of all points within a cluster)
>> would allow me to quickly see if I should even bother manually checking
>> the clustering output, the logic being that if 4,3 and 2 are similar in
>> value then it's probably a decent cluster and I can manually check it. Also
>> a comparison of 3 vs 2 would indicate if the cluster contains a number of
>> distant outliers and 4 vs 2 would should show roughly how dense a cluster
>> is.
>> This makes sense right? or am I barking up the wrong tree?
>> On 25 Feb 2013, at 20:15, Ted Dunning wrote:
>>> The best way to evaluate a cluster really depends on what your purpose
>> is.
>>> My own purpose is typically to use the clustering as a description of the
>>> probability distribution of data.
>>> For that purpose, the best evaluation is distance to centroids for
>> held-out
>>> data.  The use of held-out data is critical here since otherwise you
>> could
>>> just put a single cluster at every data point and get zero distance for
>> the
>>> original data.  For held-out data, of course, the story would be
>> different.
>>> This view of things is very good from the standpoint of machine learning
>>> and data compression, but might be less useful for certain purposes that
>>> have to do with explanation of data in human readable form.  My
>> experience
>>> is that it is common for a clustering algorithm to be very good as a
>>> probability distribution description but quite bad for human inspection.
>>> My own tendency would be to adapt the outline you gave to work on
>> held-out
>>> data instead of the original training data.
>>> On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <
>>> wrote:
>>>> Hi all,
>>>> I want to find all the vectors within a cluster and then find the
>> distance
>>>> between them and every other vector within a cluster, in hopes this will
>>>> give me a good idea of how similar each vector within a cluster is as
>> well
>>>> as identify outlier vectors.
>>>> So there are 2 things I want to ask.
>>>> 1. Is this a sensible approach to evaluating the cluster quality?
>>>> 2. Is the correct file to get this info from the
>>>> clusteredPoints/parts-m-00000 file?
>>>> Thanks,
>>>> Chris

View raw message