Right, makes sense. So, by normalize, I need to replace the counts in the
matrix with probabilities.
So, I would divide everything by the sum of all the counts in the matrix?
On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <srowen@gmail.com> wrote:
> I think the quickest answer is: the formula computes the test
> statistic as a difference of log values, rather than log of ratio of
> values. By not normalizing, the entropy is multiplied by a factor (sum
> of the counts) vs normalized. So you do end up with a statistic N
> times larger when counts are N times larger.
>
> On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon
> <dangeorge.filimon@gmail.com> wrote:
> > My understanding:
> >
> > Yes, the loglikelihood ratio (2 log lambda) follows a chisquared
> > distribution with 1 degree of freedom in the 2x2 table case.
> > A ~A
> > B
> > ~B
> >
> > We're testing to see if p(A  B) = p(A  ~B). That's the null
> hypothesis. I
> > compute the LLR. The larger that is, the more unlikely the null
> hypothesis
> > is to be true.
> > I can then look at a table with df=1. And I'd get p, the probability of
> > seeing that result or something worse (the upper tail).
> > So, the probability of them being similar is 1  p (which is exactly the
> > CDF for that value of X).
> >
> > Now, my question is: in the contingency table case, why would I
> normalize?
> > It's a ratio already, isn't it?
> >
> >
> > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <srowen@gmail.com> wrote:
> >
> >> someone can check my facts here, but the loglikelihood ratio follows
> >> a chisquare distribution. You can figure an actual probability from
> >> that in the usual way, from its CDF. You would need to tweak the code
> >> you see in the project to compute an actual LLR by normalizing the
> >> input.
> >>
> >> You could use 1p then as a similarity metric.
> >>
> >> This also isn't how the test statistic is turned into a similarity
> >> metric in the project now. But 1p sounds nicer. Maybe the historical
> >> reason was speed, or, ignorance.
> >>
> >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
> >> <dangeorge.filimon@gmail.com> wrote:
> >> > When computing itemitem similarity using the loglikelihood
> similarity
> >> > [1], can I simply apply a sigmoid do the resulting values to get the
> >> > probability that two items are similar?
> >> >
> >> > Is there any other processing I need to do?
> >> >
> >> > Thanks!
> >> >
> >> > [1] http://tdunning.blogspot.ro/2008/03/surpriseandcoincidence.html
> >>
>
