mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Methods for Naming Clusters
Date Sun, 03 Jan 2010 21:47:42 GMT
Just to add a little bit based on some research I've been doing on the subject, it seems there
are several techniques for naming clusters, ranging from the mundane to the intricate:

1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the ClusterDumper -
Just sort the top terms across the docs in the cluster and spit out some subset
2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, currently requires
Lucene index, but could be modified - Calculates the log-likelihood for the terms in the vectors
and then sorts based on the value and spits out the labels
3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify the concepts
by taking the SVD of the vectors and then determine/use the base concepts derived from that
4. Frequent Phrases using something like a suffix tree or other phrase detection methods -
Implemented in in Carrot2 (Suffix Tree Clustering) and others - finds frequent phrases and
sorts them based on a weight to return

I'm probably missing some other approaches so feel free to fill in, but those are what I've
come across so far.

-Grant

On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:

> Good thing to do.
> 
> Slightly tricky to do.  But worthy.
> 
> On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gsingers@apache.org>wrote:
> 
>> My first thought is to just create an n-gram model of the same field I'm
>> clustering on (as that will allow the existing code to work unmodified), but
>> I wanted to hear what others think.  Is it worth the time?
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve



Mime
View raw message