Ok. so here's the recap of what i have gathered (also illustrates source of
my initial confusion).
(1) The original test assumes that there are two Bernoulli random
variables, A and B, and their draws are conducted at the same time. Hence,
the logic goes, the LLR statistics will asymptotically approximate
chisquared with df = 1. Another point here is that for discrete events
information measures baked in in the statistics work better _to compare_
things, than normality assumptions of the chisquared distributions, for
the few datasets studied. The hope is that this study can extrapolate to
other data sets.
Note that i have found no accuracy study done that we can use chisquared
distribution to derive pvalues and makes confidence quanification
statements as in "we have 95% confidence that was not coincidence in
behavior of the subject". Which is in part, perhaps, due to approximation,
and in part perhaps because of assumptions of the problems i discuss below.
(2) With Surprise & concidence paper, the study goes over evaluating what i
call for now a "bigram problem", which is signficantly different setting
from the CF setting in cooccurrence paper.
With bigram problem, as far as i gathered from numbers, we collect N
bigrams. N could be thought of as number of edges in directed labeled graph
connecting tokenrepresenting nodes, with labels representing counts.
Obviously this graph is then can be represented as a sparse cooccurrence
matrix. Under this setting, "A&B" represents matrix element, "withA" and
"withB" represents correspondent element of colSums(), and the total trial
count (N) represents matrix elementwise sum (aka 1st pnorm). This problem
formulation for cooccurrence matrix, as we will see, is signficantly
different from the CF cooccurrence construction.
Another thing is that with bigram problem we encounter stretch of sampling
assumptions. In order to comply with assumptions made in (1), and keeping
in mind that 1bigram = 1draw, we have to assume we do a bernoulli draw for
every term with every bigram. But actual data tells us that only exactly
two variables (A and B) can draw as true at every draw; i.e. we just do
one hypergeometrical (or multinomial, if you are willing to consider
selfcooccurrences) with k=2, and set the rest of the term set as being
drawn as 0s. Then, again, we can demonstrate that it works better to
_compare_ things still (or do topn), since assumption stretch equally
applies to all, but somehow it still feels like we cannot reliably
demostrate accuracy of confidence measurements of bigrams, and, again, it
would seem no accuracy study on a concrete dataset for confidence measure
was actually attempted.
(3) The Sebastien's paper describes another kind of problem, which i called
"CF cooccurrence" to oppose the "bigram" case. In here, we test each item
independently using individual user choice as a trial. Hence, number of
trials (N) = number of user per item, and withA and withB are taken as
elements of colsums of user/item matrix (not cooccurrence matrix). makes
sense.
Relaxation of assumption here is as follows. We obviously want to measure
unbiased trial behavor for items, since when we present them to user as
recommendation, we will solely rely on item true value given fairly
guaranteed portion of user attention. However, by the way of sampling and
construction of building our trials dataset, we experience, as they say
it, a _biased sampling_ for reasons mentioned above (UI does not guarantee
equal portions of user attention per item, so significant portion of
assumed negative trials simply did not occur).
In extreme case, when we just start the system, we have 0 exploration
happened, and thus our degenerate experiments represent an extremely biased
trials data. As we move towards more explored landscape, our bias
(hopefully) dissappear. However, it is not quite clear what time it would
take, and whether this time would be shorter than the stock time of the
item in many cases. So, in most realistic cases the trials sampling bias
will persist. Among other things, this also significantly relaxes accuracy
of any sort of confidence statement along the lines "we are 95% confident A
and B cooccurred, if any, due to chance".
(4) Now, about my problem.
My problem is more like "bigram" problem rather than "CF" problem. I am
constructing cooccurrence based on labeled graph and then evaluate the
chances of nonindependence of cooccurrences. As such, i never have A'A
operation (and no use of users as equivalent of trials).
Comparing scores and inferring top N is of course still interesting for me.
However, perhaps more importantly, i was aiming to evaluate confidence in
more or less chisquared pvalue terms and make statements along the lines,
"given biascompensated nature of trials, we can express 95% confidence of
noncoincidence of events A and B".
I do compare things as well, but also i expect a great amount of
nonconfident results (due to the way data collected), and being able to
throw away a great deal of lowconfidence cooccurrences. Maybe even 99%
cooccurrences (the way experiments are conducted, we can parametrise them
to manipulate true cooccurrence recall/precision, but we need to be able
to tell false cooccurrences more or less confidently in order to make a
good judgement about experiment parameters). Changing parameters of
experiment is obviously changing distributions of trials it produces, but
under this particular problem it is o.k. for various reasons as it also
translates into the distributions about which prediction is made.
On Mon, Aug 11, 2014 at 5:22 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Opportunity in this case does not refer to "Saw something and decided to
> buy/click/view/listen". Instead, it refers to the entire process including
> the UI and whatever discovery mechanisms exist.
>
> You are correct that it is not just measuring the human component.
>
>
>
> On Mon, Aug 11, 2014 at 4:56 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > this logic does assume though that every user had an opportunity to
> eyeball
> > (A,B) pair and chose (vastly) not to buy them whereas it may have
> > overwhelmingly happened due to lack of exploration. thus, (not A & not B)
> > figures are probably grossly overestimated
> >
> >
> > On Mon, Aug 11, 2014 at 4:53 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > > No, the question was why total number of trials sent to LLR is
> considered
> > > to be m where M is m x n is a user/item matrix.
> > >
> > > ok i got it. i made some incorrect assumption about previous code
> steps,
> > > hence my inference derailed.
> > >
> > >
> > >
> > >
> > > On Mon, Aug 11, 2014 at 4:07 PM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > >> If you binarize the original occurrence matrix then it seems to me
> that
> > >> the
> > >> values in the cooccurrence matrix *are* user counts.
> > >>
> > >> Perhaps I misunderstand your original question.
> > >>
> > >>
> > >>
> > >> On Mon, Aug 11, 2014 at 3:44 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > >> wrote:
> > >>
> > >> > sorry rather total occurrences of a pair should be sum(a_i) +
> > sum(a_j) 
> > >> > a_ij (not 1norm of course)
> > >> >
> > >> >
> > >> > On Mon, Aug 11, 2014 at 3:11 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Why coocurrence code takes number of users as total interactions?
> > >> > > shouldn't that be 1norm of the cooccurrence matrix?
> > >> > >
> > >> > >
> > >> > > On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > >> > > wrote:
> > >> > >
> > >> > >> So, compared to original paper [1], similarity is now hardcoded
> and
> > >> > >> always LLR? Do we have any plans to parameterize that further?
Is
> > >> there
> > >> > any
> > >> > >> reason to parameterize it?
> > >> > >>
> > >> > >>
> > >> > >> Also, reading the paper, i am a bit wondering  similarity
and
> > >> distance
> > >> > >> are functions that usually are moving into different directions
> > (i.e.
> > >> > >> cosine similarity and angular distance) but in the paper
distance
> > >> scores
> > >> > >> are also considered similarities? How's that?
> > >> > >>
> > >> > >> I suppose in that context LLR is considered a distance (higher
> > scores
> > >> > >> mean more `distant` items, cooccurring by chance only)?
> > >> > >>
> > >> > >> [1] http://ssc.io/wpcontent/uploads/2012/06/rec11schelter.pdf
> > >> > >>
> > >> > >> d
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
