mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: String clustering and other newbie questions
Date Tue, 01 Sep 2009 16:11:02 GMT
The k-means implementation has the idea of distance between vectors of real
numbers pretty deeply baked into it.  One example of this is that it assumes
that you can take the average (aka centroid) of a set of examples.  Taking
the average of a set of strings in the sense of Levenstein distance would be

There is an alternative algorithm called k-medoids which uses on of the
input samples as the centroid, but I would expect that this would give poor
results with Levenstein distance.

It would however, be very reasonable to use bigrams or trigrams as labels on
vector coordinates.  The vector value of a string would be derived by
weighting each bigram or trigram according to the negative log of the
prevalence of that bigram or trigram in your entire corpus.  This
representation would be highly amenable to k-means clustering.  Results
should be relatively good, although inspection of the centroids is likely to
be a bit confusing.

On Tue, Sep 1, 2009 at 5:06 AM, Juan Francisco Contreras Gaitan <> wrote:

> But if I understood you well, and as far as I know, Mahout has its own
> k-means implementation. Then, could I use it for my purposes instead of DP
> like setup?

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message