predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Log-likelihood based correlation test?
Date Fri, 17 Nov 2017 18:52:30 GMT
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score
for every non-zero value. We then keep the top K or use a threshold to decide whether to keep
of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a
large group are correlated. Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest
Neighbors) engine. The LLR score is not put into the model only an indicator that the item
survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the most closely
match it. Since PtP will have items in rows and the row will have correlating items, this
“search” methods work quite well to find items that had very similar items purchased with
it as are in the user’s history.

=============================== that is the simple explanation ========================================

Item-based recs take the model items (correlated items by the LLR test) as the query and the
results are the most similar items—the items with most similar correlating items.

The model is items in rows and items in columns if you are only using one event. PtP. If you
think it through, it is all purchased items in as the row key and other items purchased along
with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence
of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender,
one of the first useful ones. But filtering based on cooccurrence strength (PtP values without
LLR applied to them) produces much worse results than using LLR to filter for most highly
correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only
use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well
to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences).
We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes”
and “Dislikes” so LtL vs. LtL and LtD scraped from reviews and got
a 20% lift in the MAP@k score by including data for “Dislikes”.

So the benefit and use of LLR is to filter weak data from the model and allow us to see if
dislikes, and other events, correlate with likes. Adding this type of data, that is usually
thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm
is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is
it fast, taking the user’s realtime events into the query but also because it is is trivial
to add all sorts or business rules. like give me recs based on user events but only ones from
a certain category, of give me recs but only ones tagged as “in-stock” in fact the business
rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
<> Instructions
in the readme and notice it is in the 0.7.0-SNAPSHOT branch.

On Nov 17, 2017, at 7:59 AM, Andrew Troemner <> wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal
Recommender, and while Noelia's description is accurate for an intermediate step, the indexing
via ElasticSearch generates some separate relevancy scores based on their Lucene indexing
scheme. The raw LLR scores are used in building this process, but the final scores served
up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's
(to my understanding).

There are also some additional steps including down-sampling, which scrubs out very rare combinations
(which otherwise would have very high LLR's for a single observation), which partially corrects
for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's
research and summarized by Noelia, and is a solid way to approach interaction effects for
tens of thousands of items and including secondary indicators (like demographics, or implicit

Associate Principal Data Scientist | <>
Office: 317.832.4404
Mobile: 317.531.0216

On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli < <>>
Maybe someone can correct me if I am wrong but in the code I believe Elasticsearch is used
instead of "resulting LLR is what goes into the AB element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in Elasticsearch per item-event

You can configure the thresholds for significance using the configuration parameters: maxCorrelatorsPerItem
or minLLR.  And this configuration is important because at default of 50 you may end up treating
all "indicator values" as significant.  More info here:

On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández < <>>

Let's see if I've understood how LLR is used in UR. Let P be the matrix for the primary conversion
indicator (say purchases) and Pt its transposed. 

Then, with a second matrix, which can be P again to make PtP or a matrix for a secondary indicator
(say L for likes) to make PtL, we take a row from Pt (item A) and a column from the second
matrix (either P or L, in this example) (item B) and we calculate the table that Ted Dunning
explains on his webpage: the number of coocurrences that item A AND B have been purchased
(or purchased AND liked), the number of times that item A OR B have been purchased (or purchased
OR liked), and the number of times that neither item A nor B have been purchased (or purchased
or liked). With this counts we calculate LLR following the formulas that Ted Dunning provides
and the resulting LLR is what goes into the AB element in matrix PtP or PtL. Correct?   

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández < <>>
Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is used but only vaguely...
I still don't know the different parts well enough to have a good understanding of what each
of them do (Spark, MLLib, PIO, Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi < <>>
Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole idea of Search-based
Recommenders stems from his work and insights.  If u didn't know, the PIO UR uses Apache Mahout
under the hood and hence u see the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli < <>>
I am pretty sure the LLR stuff in UR is based off of this blog post and associated paper: <>

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning <>

On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández < <>>

I've been trying to understand how the UR algorithm works and I think I have a general idea.
But I would like to have a mathematical description of the step in which the LLR comes into
play. In the CCO presentations I have found it says:

(PtP) compares column to column using log-likelihood based correlation test

However, I have searched for "log-likelihood based correlation test" in google but no joy.
All I get are explanations of the likelihood-ratio test to compare two models. 

I would very much appreciate a math explanation of log-likelihood based correlation test.
Any pointers to papers or any other literature that explains this specifically are much appreciated.

Best regards,

You received this message because you are subscribed to the Google Groups "actionml-user"
To unsubscribe from this group and stop receiving emails from it, send an email to
To post to this group, send email to <>.
To view this discussion on the web visit
For more options, visit <>.

View raw message