mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Methods for Naming Clusters
Date Sun, 03 Jan 2010 19:04:47 GMT
Resuscitating this again...

So, I committed MAHOUT-163 (thanks, Shashi!) which implements Ted's log likelihood ideas and
I've been trying it out and also comparing it to what Carrot2 does for generating labels.
 One of the things that I think would make sense is to extend MAHOUT-163 to have the option
to return phrases instead of just terms.  My first thought is to just create an n-gram model
of the same field I'm clustering on (as that will allow the existing code to work unmodified),
but I wanted to hear what others think.  Is it worth the time?

I'm also interested in other approaches people have taken.

-Grant

On Sep 5, 2009, at 4:58 PM, Sebastien Bratieres wrote:

> Hi,
> 
> (I know this is an old topic -- but I am ressuscitating it on purpose !)
> 
> I've come across this article (Lafferty & Blei 2009)
> http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build
> upon Ted's log likelihood ratio. The goal is exactly the original poster's
> question: how to characterize a topic cluster with its terms.
> Ted, I'd be interested in knowing your opinion on this article; most
> importantly, how easily it can be implemented and what improvement it brings
> over LLR.
> 
> I hope this can help people on the list who are busy with topic clustering !
> 
> Sebastien
> 
> 
> 2009/8/12 Shashikant Kore <shashikant@gmail.com>
> 
>> I was referring to the condition where a phrase is identifies as good
>> by LLR and is also prominent feature of centroid.  But, as you
>> clarified, only LLR score is good indicator for top labels.
>> 
>> Thanks for the pointer for co-occurrence statistics. I will study some
>> literature on that.
>> 
>> --shashi
>> 
>> On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<ted.dunning@gmail.com>
>> wrote:
>>> On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <shashikant@gmail.com
>>> wrote:
>>> 
>>>> 
>>>> Is this a necessary & sufficient  condition for a good cluster label?
>>> 
>>> 
>>> I am not entirely clear what "this" is.  My assertion is that high LLR
>> score
>>> is sufficient evidence to use the term or phrase.  I generally also limit
>>> the number of terms as well, taking only the highest scoring ones.  The
>>> necessary and sufficient phrase comes from a rigorous mathematical
>>> background that doesn't entirely apply here where we are talking about
>>> heuristics like this.
>>> 
>>> 
>>>> On a different note,  is there any way to identify relationship among
>>>> the top labels of the clusters? For example, if I have cluster related
>>>> automobiles, I may get the companies (GM, Ford, Toyota) along with
>>>> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
>>>> figure out Toyota and Corolla are strongly related?
>>> 
>>> 
>>> Look at the co-occurrence statistics of the terms themselves.  Use that
>> to
>>> form a sparse graph.  Then do spectral clustering or agglomerative
>>> clustering on the graph.
>>> 
>>> That will give you clusters of terms that will give you much of what you
>>> seek.  Of course, the fact that the terms are being used to describe the
>>> same cluster means that you have a good chance of just replicating the
>> label
>>> sets for your clusters.
>>> 
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>> 
>> 



Mime
View raw message