mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Methods for Naming Clusters
Date Wed, 12 Aug 2009 17:53:10 GMT
On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <>wrote:

> Is this a necessary & sufficient  condition for a good cluster label?

I am not entirely clear what "this" is.  My assertion is that high LLR score
is sufficient evidence to use the term or phrase.  I generally also limit
the number of terms as well, taking only the highest scoring ones.  The
necessary and sufficient phrase comes from a rigorous mathematical
background that doesn't entirely apply here where we are talking about
heuristics like this.

> On a different note,  is there any way to identify relationship among
> the top labels of the clusters? For example, if I have cluster related
> automobiles, I may get the companies (GM, Ford, Toyota) along with
> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
> figure out Toyota and Corolla are strongly related?

Look at the co-occurrence statistics of the terms themselves.  Use that to
form a sparse graph.  Then do spectral clustering or agglomerative
clustering on the graph.

That will give you clusters of terms that will give you much of what you
seek.  Of course, the fact that the terms are being used to describe the
same cluster means that you have a good chance of just replicating the label
sets for your clusters.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message