mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: String clustering and other newbie questions
Date Tue, 01 Sep 2009 16:44:33 GMT
If I may attempt to clarify I think - indeed, it makes no sense to have a
vector whose elements are 'string valued', nor can I think of any mapping to
doubles that has any use here.

What he is really after is clustering strings like they are vectors
themselves, not elements of another vector. The question is, how much do we
need to be able to think of strings like vectors to make the algorithm work?

We need a distance metric and he's suggesting Levenshtein, which seems OK at
first glance. (It satisfied the triangle inequality ... I think?)

Centroids are just strings that are a similar number of edits away from
another set of strings.

Distances are discrete, does that matter though?

Anything else that doesn't map? Haven't thought about it a lot but don't yet
see why k-means couldn't let you cluster strings. In the CF code I do
something similar for arbitrary 'items' so that hints to me that a well
behaved distance metric is all you need?

Of course, the code wouldn't quite work as-is to perform this. One would
need to probably modify it a lot.

For what it is worth... you could actually get the TreeClusteringRecommender
class to cluster you strings with just a little work. I am not sure if it
implements the algorithm you want. It is also not distributed.


On Sep 1, 2009 5:14 PM, "Ted Dunning" <> wrote:

That particular trick wouldn't work because you are losing the essence of
real numbers with this step.  If 1.0 refers to one string and 2.0 refers to
another, what does 1.5 refer to?

Better to use trigrams as the labels for the coordinates and weight them by
inverse document frequency.

On Tue, Sep 1, 2009 at 6:28 AM, Juan Francisco Contreras Gaitan <> wrote:

> ... I could use a Map between doubles and strings: storaging doubles in

> the algorithm, and retrieving the strings to compute distance in measuring
> steps. >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message