mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Taste on Mahout
Date Wed, 21 May 2008 21:00:39 GMT
LLR stands for log likelihood ratio (= -2 log lambda).

The table that I gave in the previous email is a bit easier to get to from
the numbers you usually have in practice.

I have some Java code handy that will help implement this function.  It
would require a bit of API coercion to use the right kind of matrix, but
otherwise should work.   I should file a Jira and then post a patch.

The chi-squared value is a non-negative as you say.  Taking the square root
and adding a sign makes the 1 degree of freedom (aka 2x2 table aka
coocurrence) version into an asymptotically normally distributed value if
the events are independent.  That makes interpretation easier for lots of
people since you can talk about standard deviations.

The formula in section 8 is essentially the same as what I gave earlier, but
it is limited to the 2x2 case.

The confusion about notation is endemic with applications of this because it
can be applied for fairly different kinds of problems.

If you are doing feature selection, then usually you have two classes of
training examples.  Let's call these classes 1 and 2.  These classes might
be "relevant" and "non-relevant" for document classification or "appears
after A" and "appears after something not A" for testing which words coocur
with A.  We also have a feature of interest.  For document classification,
that feature might be "word w occurred at least once".  For cooccurrence
testing, that feature might be "word B appears".  In any case, we should be
able to get counts k_1 for the number of times the feature occurred in class
1 and n_1 for the number of examples in class 1.  Similarly, k_2 and n_2 are
the number of occurrences of the feature in class 2 and the number of
examples in class 2.  The table from my previous example would then be:

          k_1         n_1 - k_1
          k_2         n_2 - k_2

and the implementation would proceed as before.  It actually is useful to
have the case of arbitrary size and shape count matrices because, we might
have classes 1, 2 and 3, or our feature might have more than 2 values.  An
example of 3 classes might be "judged relevant", "judged non-relevant" and
"unjudged" for the document classification problem.  An example of a trinary
feature might be "word occured > 1", "word occurred once" and "word didn't
occur".  The issue comes up often enough to make it important to implement
the general case.

Hope this helps.

On Wed, May 21, 2008 at 12:00 PM, Sean Owen <srowen@gmail.com> wrote:

> Thanks, I admit this is pushing the limits of my knowledge of stats.
> Maybe you can help me offline here. What's the LLR function, and how
> does this table of four values related to k1 and k2 in the formula in
> section 8? is that formula OK to implement, since I think I do get
> that one.
>
> The chi-squared stat is a positive value right? then yes to map into
> [-1,1] to make it correlation-like I'd need a simple transformation or
> two like what you describe.
>
> On Wed, May 21, 2008 at 2:49 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > Yes.  If you have things A and B and counts k_ab for when A and B occur
> > together, k_a for when A appears with or without B, k_b similar to k_a
> but
> > for B and N which is the total observations then you build the table:
> >
> >       k_ab             k_a - k_ab
> >       k_b - k_ab     N - k_a - k_b + k_ab
> >
> > and then you can apply the LLR function directly.
> >
> > It is often helpful to report sqrt(-2 log lambda) with a sign according
> to
> > whether k_ab is greater of less than expected, i.e.
> >
> >      signum( k_ab / k_b - k_a / N ) * sqrt(- 2 log lambda)
> >
> > Also, for computation of - 2 log lambda, it is usually easier to compute
> it
> > in terms of mutual information which is in turn expressed in terms of
> > entropy:
> >
> >    - 2 log lambda = N * MI = N * ( H(K) - H(rowSums(K)) - H(colSums(K)) )
> >
> >    H(X) = - sum (X / sum(X)) logSafe (X / sum(X))
> >
> >    logSafe(x) = log(x + (x == 0))
> >
> > The resulting score is not directly suitable as a distance measure, but
> is
> > very handy for masking co-occurrence matrices.
> >
> > On Wed, May 21, 2008 at 11:22 AM, Sean Owen <srowen@gmail.com> wrote:
> >
> >> Got it, so do I have it right that you suggest defining the
> >> "correlation" as really the chi-squared statistic, the -2 log lamba
> >> formula? k1 is the number of items 'preferred' by user 1, and n1 is
> >> the total number of items in the universe, and likewise for k2/n2? so
> >> n1 == n2?
> >>
> >> Simple is cool with me, to start. This sounds more sophisticated than
> >> a simple intersection/union approach. I could add that algorithm too
> >> for kicks.
> >>
> >> On Wed, May 21, 2008 at 12:58 PM, Ted Dunning <ted.dunning@gmail.com>
> >> wrote:
> >> > Correlation (per se) between such sparse binary vectors can be very
> >> > problematic.
> >> >
> >> > This is a general problem with this kind of data and really needs to
> be
> >> > handled directly.  Not clicking on an item is much less informative
> than
> >> > clicking on an item (so little time, so much to click).  Any system
> you
> >> > build has to deal with that and with coincidence.  For instance, raw
> >> > correlation gives 100% match for two people who happen to have clicked
> on
> >> > the same single item.  IF that item is a very popular one, however,
> this
> >> is
> >> > not a very interesting fact.
> >> >
> >> > One very simple way of dealing with this was described in
> >> > http://citeseer.ist.psu.edu/29096.html .  Since then, I have found
> >> other,
> >> > more comprehensive techniques but they are considerably more complex.
> >>
> >
> >
> >
> > --
> > ted
> >
>



-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message