mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: LLR Scoring question
Date Wed, 13 Jan 2010 17:00:05 GMT
Ted,

Thank you for the tip.

>
>    rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR)
>

I didn't get what k1* and k2* are. I used (k11+k12) and (k21+k22) in
the denominator. That gives correct result.

--shashi

On Wed, Jan 13, 2010 at 12:50 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Raw LLR has a large value whenever there is an anomaly.  In this case, term2
> is rare in the cluster and common outside and is thus an anomaly.
>
> One thing that I do is to use a variant of the LLR score:
>
>    rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR)
>


> This score has two advantages over the basic LLR:
>
> a) it is positive where k11 is bigger than expected, negative where it is
> lower.  This resolves your current problem.
>
> b) if there is no difference it is asymptotically normally distributed.
> This allows people to talk about "number of standard deviations" which is a
> more common frame of reference than the chi^2 distribution.
>
>
> On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <shashikant@gmail.com>wrote:
>
>> As I can see Term1 is rarer outside the cluster, but common in the
>> cluster (relatively speaking.) But, when I calculate LLR scores,
>> Term1's score (3569) is lower than that of Term2 (3622). This looks
>> counter-intuitive to me. Is it the case that LLR score is higher if
>> term is common outside the cluster and rare inside?  Can this be
>> "fixed"?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
View raw message