If you are using rootLLR, then a threshold of 10 represents (roughly) 10
standard deviations. This is a big threshold.
What I generally do is threshold to a level that either makes the resulting
pairs be composed of a weak majority of plausible terms (if I understand
the domain) or simply to drive to a level of sparsity. Both seem about the
same.
I also pretty much always also downsample the number of items per user.
This has two motivations. One is simply pragmatics. The other is that it
decreases the influence of bots and other pathological users.
On Mon, Feb 11, 2013 at 2:57 AM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:
>
>
> I am also thresholding the counts with LLR. Every time i do this I take a
> threshold of 10 since I loosely remember it being about the 99% margin of
> confidence in the chi square distribution. I got no clue however if anybody
> wants something like 99% for recommendations or if 50% might be a better
> value. What's your experience on that?
>
> And do you apply a limit on the total number of docs per term, since there
> could be big boolean queries tearing down the performance?
>
> Thanks for all the input!
