mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Re: Methods for Naming Clusters
Date Tue, 11 Aug 2009 17:54:20 GMT
On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning<> wrote:
> If you expand the LLR equation and look at which terms are big, you will see
> k_11 * log(mumble)  as an important term for many words.  Usually, this is
> about the same as tf.idf since mumble is about the same as the term
> frequency.  For a single document, tf.idf is a very close approximation of
> LLR.  With many documents, the situation can change dramatically, however,
> and you can get cancellation effects that eliminate documents that would
> otherwise have high tf.idf.  These are generally the terms that lead to
> over-fitting with methods like naive bayes and are often not such great
> cluster descriptors.

Let me restate what I understood.

If a phrase is identified as prominent phrase by LLR and it also
happens to be the top-weighted feature in the centroid vector, it is
not a good candidate for cluster label.

Is this correct?


View raw message