mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: LLR Scoring question
Date Tue, 12 Jan 2010 19:20:35 GMT
Raw LLR has a large value whenever there is an anomaly.  In this case, term2
is rare in the cluster and common outside and is thus an anomaly.

One thing that I do is to use a variant of the LLR score:

    rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR)

This score has two advantages over the basic LLR:

a) it is positive where k11 is bigger than expected, negative where it is
lower.  This resolves your current problem.

b) if there is no difference it is asymptotically normally distributed.
This allows people to talk about "number of standard deviations" which is a
more common frame of reference than the chi^2 distribution.

On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <>wrote:

> As I can see Term1 is rarer outside the cluster, but common in the
> cluster (relatively speaking.) But, when I calculate LLR scores,
> Term1's score (3569) is lower than that of Term2 (3622). This looks
> counter-intuitive to me. Is it the case that LLR score is higher if
> term is common outside the cluster and rare inside?  Can this be
> "fixed"?

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message