Mailing-List: contact user-help@predictionio.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@predictionio.apache.org
MIME-Version: 1.0
In-Reply-To: <10DDF943-D837-4152-8920-9BC3395C18A4@occamsmachete.com>
References: <CAMysefvrPhjcVdjDLBhCvBiYX-eAxSLMz-o6XYOGRrk1PGT72w@mail.gmail.com>
 <CAEWeDuzki5TJD=FHsvqMEL7iXeVrq=Va96dySqADDpKu1MuvKQ@mail.gmail.com>
 <CAOtpBjiixy7OiXSBT7EnXstv6Esr9U3yYp4fPOrMcZKtSCjw5A@mail.gmail.com>
 <CAMyseftTWz_QM51XYEzTx0OYGp+xOq2B+Re9nXEx7hqyYi4n2g@mail.gmail.com>
 <CAMyseftScfEdMJXS+CaF13_j-pHPVOZ5Am_CbjP9g66tu-pBRA@mail.gmail.com>
 <CAEWeDuzEXeKBZReG6Fq315m=rh_rrsgrD88AdJi+6V_KZhF+cQ@mail.gmail.com>
 <CAA2BRS+oj+NYDmsNNd2mYM1ZC5CgWwC71W3=EhrO9qeOiKyWXA@mail.gmail.com> <10DDF943-D837-4152-8920-9BC3395C18A4@occamsmachete.com>
From: =?UTF-8?Q?Noelia_Os=C3=A9s_Fern=C3=A1ndez?= <noses@vicomtech.org>
Date: Tue, 21 Nov 2017 10:28:52 +0100
Message-ID: <CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA@mail.gmail.com>
Subject: Re: Log-likelihood based correlation test?
To: user@predictionio.apache.org
Cc: Andrew Troemner <atroemner@salesforce.com>,
	actionml-user <actionml-user@googlegroups.com>
Content-Type: multipart/alternative; boundary="001a11442a10d4bfb7055e7ad490"
archived-at: Tue, 21 Nov 2017 09:29:05 -0000

--001a11442a10d4bfb7055e7ad490
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Pat,

If I understood your explanation correctly, you say that some elements of
PtP are removed by the LLR (set to zero, to be precise). But the elements
that survive are calculated by matrix multiplication. The final PtP is put
into EleasticSearc and when we query for user recommendations ES uses KNN
to find the items (the rows in PtP) that are most similar to the user's
history.

If the non-zero elements of PtP have been calculated by straight matrix
multiplication, and I'm assuming that the P matrix only has 0s and 1s to
indicate which items have been purchased by which user, then the elements
of PtP are either 0 or greater to or equal than 1. However, the scores I
get are below 1.

So is the KNN using cosine similarity as a metric to calculate the closest
neighbours? And is the results of this cosine similarity metric what is
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <pat@occamsmachete.com> wrote:

> Mahout builds the model by doing matrix multiplication (PtP) then
> calculating the LLR score for every non-zero value. We then keep the top =
K
> or use a threshold to decide whether to keep of not (both are supported i=
n
> the UR). LLR is a metric for seeing how likely 2 events in a large group
> are correlated. Therefore LLR is only used to remove weak data from the
> model.
>
> So Mahout builds the model then it is put into Elasticsearch which is use=
d
> as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the
> model only an indicator that the item survived the LLR test.
>
> The KNN is applied using the user=E2=80=99s history as the query and find=
ing items
> the most closely match it. Since PtP will have items in rows and the row
> will have correlating items, this =E2=80=9Csearch=E2=80=9D methods work q=
uite well to find
> items that had very similar items purchased with it as are in the user=E2=
=80=99s
> history.
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D that is the simple explanation
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> Item-based recs take the model items (correlated items by the LLR test) a=
s
> the query and the results are the most similar items=E2=80=94the items wi=
th most
> similar correlating items.
>
> The model is items in rows and items in columns if you are only using one
> event. PtP. If you think it through, it is all purchased items in as the
> row key and other items purchased along with the row key. LLR filters out
> the weakly correlating non-zero values (0 mean no evidence of correlation
> anyway). If we didn=E2=80=99t do this it would be purely a =E2=80=9CCoocc=
urrence=E2=80=9D
> recommender, one of the first useful ones. But filtering based on
> cooccurrence strength (PtP values without LLR applied to them) produces
> much worse results than using LLR to filter for most highly correlated
> cooccurrences. You get a similar effect with Matrix Factorization but you
> can only use one type of event for various reasons.
>
> Since LLR is a probabilistic metric that only looks at counts, it can be
> applied equally well to PtV (purchase, view), PtS (purchase, search terms=
),
> PtC (purchase, category-preferences). We did an experiment using Mean
> Average Precision for the UR using video =E2=80=9CLikes=E2=80=9D vs =E2=
=80=9CLikes=E2=80=9D and =E2=80=9CDislikes=E2=80=9D
> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a
> 20% lift in the MAP@k score by including data for =E2=80=9CDislikes=E2=80=
=9D.
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-
> occurences/
>
> So the benefit and use of LLR is to filter weak data from the model and
> allow us to see if dislikes, and other events, correlate with likes. Addi=
ng
> this type of data, that is usually thrown away is one the the most powerf=
ul
> reasons to use the algorithm=E2=80=94BTW the algorithm is called Correlat=
ed
> Cross-Occurrence (CCO).
>
> The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN
> query is that is it fast, taking the user=E2=80=99s realtime events into =
the query
> but also because it is is trivial to add all sorts or business rules. lik=
e
> give me recs based on user events but only ones from a certain category, =
of
> give me recs but only ones tagged as =E2=80=9Cin-stock=E2=80=9D in fact t=
he business rules
> can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs=
.
>
> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Ins=
tructions
> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>
>
> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroemner@salesforce.com>
> wrote:
>
> I'll echo Dan here. He and I went through the raw Mahout libraries called
> by the Universal Recommender, and while Noelia's description is accurate
> for an intermediate step, the indexing via ElasticSearch generates some
> separate relevancy scores based on their Lucene indexing scheme. The raw
> LLR scores are used in building this process, but the final scores served
> up by the API's should be post-processed, and cannot be used to reconstru=
ct
> the raw LLR's (to my understanding).
>
> There are also some additional steps including down-sampling, which scrub=
s
> out very rare combinations (which otherwise would have very high LLR's fo=
r
> a single observation), which partially corrects for the statistical probl=
em
> of multiple detection. But the underlying logic is per Ted Dunning's
> research and summarized by Noelia, and is a solid way to approach
> interaction effects for tens of thousands of items and including secondar=
y
> indicators (like demographics, or implicit preferences).
>
>
> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
> Office: 317.832.4404
> Mobile: 317.531.0216
>
>
>
> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>
> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.co=
m
> > wrote:
>
>> Maybe someone can correct me if I am wrong but in the code I believe
>> Elasticsearch is used instead of "resulting LLR is what goes into the AB
>> element in matrix PtP or PtL."
>>
>> By default the strongest 50 LLR scores get set as searchable values in
>> Elasticsearch per item-event pair.
>>
>> You can configure the thresholds for significance using the configuratio=
n
>> parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is
>> important because at default of 50 you may end up treating all "indicato=
r
>> values" as significant.  More info here: http://actionml.com/docs
>> /ur_config
>>
>>
>>
>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Os=C3=A9s Fern=C3=A1ndez <
>> noses@vicomtech.org> wrote:
>>
>>>
>>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>>> for the primary conversion indicator (say purchases) and Pt its transpo=
sed.
>>>
>>>
>>> Then, with a second matrix, which can be P again to make PtP or a matri=
x
>>> for a secondary indicator (say L for likes) to make PtL, we take a row =
from
>>> Pt (item A) and a column from the second matrix (either P or L, in this
>>> example) (item B) and we calculate the table that Ted Dunning explains =
on
>>> his webpage: the number of coocurrences that item A *AND* B have been
>>> purchased (or purchased AND liked), the number of times that item A *OR=
*
>>>  B have been purchased (or purchased OR liked), and the number of times
>>> that *neither* item A nor B have been purchased (or purchased or
>>> liked). With this counts we calculate LLR following the formulas that T=
ed
>>> Dunning provides and the resulting LLR is what goes into the AB element=
 in
>>> matrix PtP or PtL. Correct?
>>>
>>> Thank you!
>>>
>>> On 16 November 2017 at 17:03, Noelia Os=C3=A9s Fern=C3=A1ndez <noses@vi=
comtech.org
>>> > wrote:
>>>
>>>> Wonderful! Thanks Daniel!
>>>>
>>>> Suneel, I'm still new to the Apache ecosystem and so I know that Mahou=
t
>>>> is used but only vaguely... I still don't know the different parts wel=
l
>>>> enough to have a good understanding of what each of them do (Spark, ML=
Lib,
>>>> PIO, Mahout,...)
>>>>
>>>> Thank you both!
>>>>
>>>> On 16 November 2017 at 16:59, Suneel Marthi <smarthi@apache.org> wrote=
:
>>>>
>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>> whole idea of Search-based Recommenders stems from his work and insig=
hts.
>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and he=
nce u
>>>>> see the LLR.
>>>>>
>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@
>>>>> salesforce.com> wrote:
>>>>>
>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>> and associated paper:
>>>>>>
>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>
>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>> by Ted Dunning
>>>>>>
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.14.5962
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Os=C3=A9s Fern=C3=A1ndez <
>>>>>> noses@vicomtech.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>> think I have a general idea. But I would like to have a *mathematic=
al
>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>> CCO presentations I have found it says:
>>>>>>>
>>>>>>> (PtP) compares column to column using
>>>>>>> *log-likelihood based correlation test*
>>>>>>>
>>>>>>> However, I have searched for "log-likelihood based correlation test=
"
>>>>>>> in google but no joy. All I get are explanations of the likelihood-=
ratio
>>>>>>> test to compare two models.
>>>>>>>
>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>> based correlation test. Any pointers to papers or any other literat=
ure that
>>>>>>> explains this specifically are much appreciated.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Noelia
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscribe@googlegroups.com.
> To post to this group, send email to actionml-user@googlegroups.com.
> To view this discussion on the web visit https://groups.google.
> com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%
> 3DEhrO9qeOiKyWXA%40mail.gmail.com
> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2=
mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=3Demail&utm_s=
ource=3Dfooter>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>


--=20
<http://www.vicomtech.org>

Noelia Os=C3=A9s Fern=C3=A1ndez, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energ=C3=ADa y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos=
>

--001a11442a10d4bfb7055e7ad490
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div>Pat,<br><br></div>If I understood=
 your explanation correctly, you say that some elements of PtP are removed =
by the LLR (set to zero, to be precise). But the elements that survive are =
calculated by matrix multiplication. The final PtP is put into EleasticSear=
c and when we query for user recommendations ES uses KNN to find the items =
(the rows in PtP) that are most similar to the user&#39;s history.<br><br><=
/div>If the non-zero elements of PtP have been calculated by straight matri=
x multiplication, and I&#39;m assuming that the P matrix only has 0s and 1s=
 to indicate which items have been purchased by which user, then the elemen=
ts of PtP are either 0 or greater to or equal than 1. However, the scores I=
 get are below 1.<br><br></div>So is the KNN using cosine similarity as a m=
etric to calculate the closest neighbours? And is the results of this cosin=
e similarity metric what is returned as a &#39;score&#39;?<br><br></div>If =
it is, when it is greater than 1, is this because the different cosine simi=
larities are added together i.e. PtP, PtL... ?<br><br></div>Thank you for a=
ll your valuable help!<br></div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On 17 November 2017 at 19:52, Pat Ferrel <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:pat@occamsmachete.com" target=3D"_blank">pat@occamsm=
achete.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div sty=
le=3D"word-wrap:break-word">Mahout builds the model by doing matrix multipl=
ication (PtP) then calculating the LLR score for every non-zero value. We t=
hen keep the top K or use a threshold to decide whether to keep of not (bot=
h are supported in the UR). LLR is a metric for seeing how likely 2 events =
in a large group are correlated. Therefore LLR is only used to remove weak =
data from the model.<div><br></div><div>So Mahout builds the model then it =
is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engi=
ne. The LLR score is not put into the model only an indicator that the item=
 survived the LLR test.</div><div><br></div><div>The KNN is applied using t=
he user=E2=80=99s history as the query and finding items the most closely m=
atch it. Since PtP will have items in rows and the row will have correlatin=
g items, this =E2=80=9Csearch=E2=80=9D methods work quite well to find item=
s that had very similar items purchased with it as are in the user=E2=80=99=
s history.</div><div><br></div><div>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<wbr>=3D that is the sim=
ple explanation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<wbr>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</div><=
div><br></div><div>Item-based recs take the model items (correlated items b=
y the LLR test) as the query and the results are the most similar items=E2=
=80=94the items with most similar correlating items.</div><div><br></div><d=
iv>The model is items in rows and items in columns if you are only using on=
e event. PtP. If you think it through, it is all purchased items in as the =
row key and other items purchased along with the row key. LLR filters out t=
he weakly correlating non-zero values (0 mean no evidence of correlation an=
yway). If we didn=E2=80=99t do this it would be purely a =E2=80=9CCooccurre=
nce=E2=80=9D recommender, one of the first useful ones. But filtering based=
 on cooccurrence strength (PtP values without LLR applied to them) produces=
 much worse results than using LLR to filter for most highly correlated coo=
ccurrences. You get a similar effect with Matrix Factorization but you can =
only use one type of event for various reasons.</div><div><br></div><div>Si=
nce LLR is a probabilistic metric that only looks at counts, it can be appl=
ied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC=
 (purchase, category-preferences). We did an experiment using Mean Average =
Precision for the UR using video =E2=80=9CLikes=E2=80=9D vs =E2=80=9CLikes=
=E2=80=9D and =E2=80=9CDislikes=E2=80=9D so LtL vs. LtL and LtD scraped fro=
m <a href=3D"http://rottentomatoes.com" target=3D"_blank">rottentomatoes.co=
m</a> reviews and got a 20% lift in the MAP@k score by including data for =
=E2=80=9CDislikes=E2=80=9D.=C2=A0<a href=3D"https://developer.ibm.com/dwblo=
g/2017/mahout-spark-correlated-cross-occurences/" target=3D"_blank">https:/=
/developer.<wbr>ibm.com/dwblog/2017/mahout-<wbr>spark-correlated-cross-<wbr=
>occurences/</a></div><div><br></div><div>So the benefit and use of LLR is =
to filter weak data from the model and allow us to see if dislikes, and oth=
er events, correlate with likes. Adding this type of data, that is usually =
thrown away is one the the most powerful reasons to use the algorithm=E2=80=
=94BTW the algorithm is called Correlated Cross-Occurrence (CCO).</div><div=
><br></div><div>The benefit of using Lucene (at the heart of Elasticsearch)=
 to do the KNN query is that is it fast, taking the user=E2=80=99s realtime=
 events into the query but also because it is is trivial to add all sorts o=
r business rules. like give me recs based on user events but only ones from=
 a certain category, of give me recs but only ones tagged as =E2=80=9Cin-st=
ock=E2=80=9D in fact the business rules can have inclusion rules, exclusion=
 rules, and be mixed with ANDs and ORs.</div><div><br></div><div>BTW there =
is a version ready for testing with PIO 0.12.0 and ES5 here:=C2=A0<a href=
=3D"https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT" =
target=3D"_blank">https://github.com/<wbr>actionml/universal-<wbr>recommend=
er/tree/0.7.0-<wbr>SNAPSHOT</a>=C2=A0Instructions in the readme and notice =
it is in the 0.7.0-SNAPSHOT branch.</div><div><br></div><div><br></div><div=
><div><div><div class=3D"h5"><div>On Nov 17, 2017, at 7:59 AM, Andrew Troem=
ner &lt;<a href=3D"mailto:atroemner@salesforce.com" target=3D"_blank">atroe=
mner@salesforce.com</a>&gt; wrote:</div><br class=3D"m_-6708140893097736223=
Apple-interchange-newline"></div></div><div><div><div class=3D"h5"><div dir=
=3D"ltr" style=3D"font-family:Helvetica;font-size:12px;font-style:normal;fo=
nt-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:=
start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0=
px">I&#39;ll echo Dan here. He and I went through the raw Mahout libraries =
called by the Universal Recommender, and while Noelia&#39;s description is =
accurate for an intermediate step, the indexing via ElasticSearch generates=
 some separate relevancy scores based on their Lucene indexing scheme. The =
raw LLR scores are used in building this process, but the final scores serv=
ed up by the API&#39;s should be post-processed, and cannot be used to reco=
nstruct the raw LLR&#39;s (to my understanding).<div><br></div><div>There a=
re also some additional steps including down-sampling, which scrubs out ver=
y rare combinations (which otherwise would have very high LLR&#39;s for a s=
ingle observation), which partially corrects for the statistical problem of=
 multiple detection. But the underlying logic is per Ted Dunning&#39;s rese=
arch and summarized by Noelia, and is a solid way to approach interaction e=
ffects for tens of thousands of items and including secondary indicators (l=
ike demographics, or implicit preferences).</div></div><div class=3D"gmail_=
extra" style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font=
-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:st=
art;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px=
"><br clear=3D"all"><div><div class=3D"m_-6708140893097736223gmail_signatur=
e" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr=
"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div =
dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div><div dir=3D"ltr"><p styl=
e=3D"background-repeat:initial initial"><b><font face=3D"Arial, sans-serif"=
><span style=3D"font-size:10pt">ANDREW TROEMNER</span></font><span style=3D=
"font-size:14px"><br></span></b><span style=3D"font-size:10pt;font-family:A=
rial,sans-serif">Associate Principal Data Scientist |=C2=A0</span><span sty=
le=3D"font-size:10pt;font-family:Arial,sans-serif"><a href=3D"http://salesf=
orce.com/" style=3D"color:blue;font-size:12.8000001907349px" target=3D"_bla=
nk">salesforce.com</a><font color=3D"#0000ff"><br></font></span><span style=
=3D"font-family:Arial,sans-serif;font-size:10pt">Office: 317.832.4404<br></=
span><span style=3D"font-family:Arial,sans-serif;font-size:10pt">Mobile: 31=
7.531.0216</span></p><p style=3D"background-repeat:initial initial"><span s=
tyle=3D"font-family:Arial,sans-serif;font-size:10pt"><br></span></p></div><=
/div></div></div></div></div></div></div></div></div></div></div><br><a hre=
f=3D"http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html"=
 target=3D"_blank"><img src=3D"http://smart.salesforce.com/sig/atroemner//u=
s_mb_kb/default/logo.png" border=3D"0"></a></div></div><br><div class=3D"gm=
ail_quote">On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli<span class=3D"m=
_-6708140893097736223Apple-converted-space">=C2=A0</span><span dir=3D"ltr">=
&lt;<a href=3D"mailto:dgabrieli@salesforce.com" target=3D"_blank">dgabrieli=
@<wbr>salesforce.com</a>&gt;</span><span class=3D"m_-6708140893097736223App=
le-converted-space">=C2=A0</span>wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style=
:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr=
"><div>Maybe someone can correct me if I am wrong but in the code I believe=
 Elasticsearch is used instead of &quot;<span style=3D"color:rgb(33,33,33)"=
>resulting LLR is what goes into the AB element in matrix PtP or PtL.&quot;=
</span><br></div><div><br></div><div>B<span style=3D"color:rgb(33,33,33)">y=
 default the strongest 50 LLR scores get set as searchable values=C2=A0in E=
lasticsearch per item-event pair.</span></div><div><span style=3D"color:rgb=
(33,33,33)"><br></span></div><div><span style=3D"color:rgb(33,33,33)">You c=
an configure the thresholds for significance using the configuration parame=
ters:=C2=A0</span><span style=3D"color:rgb(33,33,33)">maxCorrelatorsPerI<wb=
r>tem or minLLR</span><font color=3D"#212121">.</font><span style=3D"color:=
rgb(33,33,33)">=C2=A0=C2=A0</span><span style=3D"color:rgb(33,33,33)">And t=
his configuration is important because at default of 50 you may end up trea=
ting all &quot;indicator values&quot; as significant.=C2=A0 More info here:=
=C2=A0</span><font color=3D"#212121"><a href=3D"http://actionml.com/docs/ur=
_config" target=3D"_blank">http://actionml.com/docs<wbr>/ur_config</a></fon=
t></div><div><span style=3D"color:rgb(33,33,33)"><br></span></div><div><br>=
</div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Fri, Nov 17, 2017 =
at 4:50 AM Noelia Os=C3=A9s Fern=C3=A1ndez &lt;<a href=3D"mailto:noses@vico=
mtech.org" target=3D"_blank">noses@vicomtech.org</a>&gt; wrote:<br></div><b=
lockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-le=
ft-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);pad=
ding-left:1ex"><div dir=3D"ltr"><div><div><br></div>Let&#39;s see if I&#39;=
ve understood how LLR is used in UR. Let P be the matrix for the primary co=
nversion indicator (say purchases) and Pt its transposed.<span class=3D"m_-=
6708140893097736223Apple-converted-space">=C2=A0</span><br><br></div>Then, =
with a second matrix, which can be P again to make PtP or a matrix for a se=
condary indicator (say L for likes) to make PtL, we take a row from Pt (ite=
m A) and a column from the second matrix (either P or L, in this example) (=
item B) and we calculate the table that Ted Dunning explains on his webpage=
: the number of coocurrences that item A<span class=3D"m_-67081408930977362=
23Apple-converted-space">=C2=A0</span><b>AND</b><span class=3D"m_-670814089=
3097736223Apple-converted-space">=C2=A0</span>B have been purchased (or pur=
chased AND liked), the number of times that item A<span class=3D"m_-6708140=
893097736223Apple-converted-space">=C2=A0</span><b>OR</b><span class=3D"m_-=
6708140893097736223Apple-converted-space">=C2=A0</span>B have been purchase=
d (or purchased OR liked), and the number of times that<span class=3D"m_-67=
08140893097736223Apple-converted-space">=C2=A0</span><b>neither</b><span cl=
ass=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span>item A nor=
 B have been purchased (or purchased or liked). With this counts we calcula=
te LLR following the formulas that Ted Dunning provides and the resulting L=
LR is what goes into the AB element in matrix PtP or PtL. Correct? =C2=A0<s=
pan class=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span><div=
><div><br></div><div>Thank you!<br></div></div></div><div dir=3D"ltr"><div>=
<div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On 16 Novemb=
er 2017 at 17:03, Noelia Os=C3=A9s Fern=C3=A1ndez<span class=3D"m_-67081408=
93097736223Apple-converted-space">=C2=A0</span><span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:noses@vicomtech.org" target=3D"_blank">noses@vicomtech.org</a><=
wbr>&gt;</span><span class=3D"m_-6708140893097736223Apple-converted-space">=
=C2=A0</span>wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-c=
olor:rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Wonderful! Th=
anks Daniel!</div><div><br></div><div>Suneel, I&#39;m still new to the Apac=
he ecosystem and so I know that Mahout is used but only vaguely... I still =
don&#39;t know the different parts well enough to have a good understanding=
 of what each of them do (Spark, MLLib, PIO, Mahout,...)</div><div><br></di=
v><div>Thank you both!<br></div></div><div class=3D"gmail_extra"><div><div =
class=3D"m_-6708140893097736223m_7805896808459810536m_1902281151191412627m_=
-5332908692487384610gmail-h5"><br><div class=3D"gmail_quote">On 16 November=
 2017 at 16:59, Suneel Marthi<span class=3D"m_-6708140893097736223Apple-con=
verted-space">=C2=A0</span><span dir=3D"ltr">&lt;<a href=3D"mailto:smarthi@=
apache.org" target=3D"_blank">smarthi@apache.org</a>&gt;</span><span class=
=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span>wr<wbr>ote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;borde=
r-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204)=
;padding-left:1ex"><div dir=3D"ltr">Indeed so. Ted Dunning is an Apache Mah=
out PMC and committer and the whole idea of Search-based Recommenders stems=
 from his work and insights.=C2=A0 If u didn&#39;t know, the PIO UR uses Ap=
ache Mahout under the hood and hence u see the LLR.</div><div class=3D"m_-6=
708140893097736223m_7805896808459810536m_1902281151191412627m_-533290869248=
7384610gmail-m_-6844128828329286457HOEnZb"><div class=3D"m_-670814089309773=
6223m_7805896808459810536m_1902281151191412627m_-5332908692487384610gmail-m=
_-6844128828329286457h5"><div class=3D"gmail_extra"><br><div class=3D"gmail=
_quote">On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli<span class=3D"m_-6=
708140893097736223Apple-converted-space">=C2=A0</span><span dir=3D"ltr">&lt=
;<a href=3D"mailto:dgabrieli@salesforce.com" target=3D"_blank">dgabrieli@<w=
br>salesforce.com</a>&gt;</span><span class=3D"m_-6708140893097736223Apple-=
converted-space">=C2=A0</span>wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:so=
lid;border-left-color:rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><=
div><div>I am pretty sure the LLR stuff in UR is based off of this blog pos=
t and associated paper:</div><div><br></div><div><a href=3D"http://tdunning=
.blogspot.com/2008/03/surprise-and-coincidence.html" target=3D"_blank">http=
://tdunning.blogspot.com/2<wbr>008/03/surprise-and-coincidenc<wbr>e.html</a=
></div><div><br></div><div>Accurate Methods for the Statistics of Surprise =
and Coincidence</div><div>by Ted Dunning</div><div><br></div><div><a href=
=3D"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.14.5962" targ=
et=3D"_blank">http://citeseerx.ist.psu.edu/v<wbr>iewdoc/summary?doi=3D10.1.=
1.14.5<wbr>962</a></div></div><div><br></div></div><br><div class=3D"gmail_=
quote"><div dir=3D"ltr">On Thu, Nov 16, 2017 at 10:26 AM Noelia Os=C3=A9s F=
ern=C3=A1ndez &lt;<a href=3D"mailto:noses@vicomtech.org" target=3D"_blank">=
noses@vicomtech.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-styl=
e:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div dir=3D"lt=
r"><div><div><div><div><div><div>Hi,<br><br></div>I&#39;ve been trying to u=
nderstand how the UR algorithm works and I think I have a general idea. But=
 I would like to have a<span class=3D"m_-6708140893097736223Apple-converted=
-space">=C2=A0</span><u><b>mathematical description</b></u><span class=3D"m=
_-6708140893097736223Apple-converted-space">=C2=A0</span>of the step in whi=
ch the LLR comes into play. In the CCO presentations I have found it says:<=
br><br></div>(PtP) compares column to column using<span class=3D"m_-6708140=
893097736223Apple-converted-space">=C2=A0</span><b>log-likelihood based cor=
relation test<br></b><br><br></div>However, I have searched for &quot;log-l=
ikelihood based correlation test&quot; in google but no joy. All I get are =
explanations of the likelihood-ratio test to compare two models.<span class=
=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span><br><br></div=
>I would very much appreciate a math explanation of log-likelihood based co=
rrelation test. Any pointers to papers or any other literature that explain=
s this specifically are much appreciated.<br><br></div>Best regards,<br></d=
iv>Noelia<br></div></blockquote></div></blockquote></div><br></div></div></=
div></blockquote></div><br><br clear=3D"all"></div></div></div></blockquote=
></div><div class=3D"m_-6708140893097736223m_7805896808459810536m_190228115=
1191412627m_-5332908692487384610gmail_signature"><div dir=3D"ltr"><table ce=
llspacing=3D"0" cellpadding=3D"2" border=3D"0"><tbody><tr><td><br></td></tr=
><tr><td><br></td></tr><tr><td style=3D"border-width:2px;border-color:rgb(0=
,171,201);border-bottom-style:solid"><br></td></tr><tr><td><br></td></tr><t=
r><td><br></td></tr><tr><td><br></td></tr><tr><td><span style=3D"font-size:=
10px;font-family:&#39;CENTURY GOTHIC&#39;;font-weight:normal;font-style:ita=
lic"></span><br></td></tr></tbody></table></div></div></div></div></div></d=
iv></blockquote></div></div></blockquote></div><br></div><div style=3D"font=
-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal=
;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;=
text-transform:none;white-space:normal;word-spacing:0px"><br class=3D"m_-67=
08140893097736223webkit-block-placeholder"></div></div></div><span class=3D=
"HOEnZb"><font color=3D"#888888"><span style=3D"font-family:Helvetica;font-=
size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;let=
ter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;whi=
te-space:normal;word-spacing:0px;float:none;display:inline!important">--<sp=
an class=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span></spa=
n><br style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-=
variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:sta=
rt;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"=
><span style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font=
-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:st=
art;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px=
;float:none;display:inline!important">You received this message because you=
 are subscribed to the Google Groups &quot;actionml-user&quot; group.</span=
><br style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-v=
ariant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:star=
t;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">=
<span style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-=
variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:sta=
rt;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;=
float:none;display:inline!important">To unsubscribe from this group and sto=
p receiving emails from it, send an email to<span class=3D"m_-6708140893097=
736223Apple-converted-space">=C2=A0</span></span><a href=3D"mailto:actionml=
-user+unsubscribe@googlegroups.com" style=3D"font-family:Helvetica;font-siz=
e:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter=
-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-=
space:normal;word-spacing:0px" target=3D"_blank">actionml-user+unsubscribe@=
<wbr>googlegroups.com</a><span style=3D"font-family:Helvetica;font-size:12p=
x;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spac=
ing:normal;text-align:start;text-indent:0px;text-transform:none;white-space=
:normal;word-spacing:0px;float:none;display:inline!important">.</span><br s=
tyle=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-variant=
-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text=
-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span =
style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-varian=
t-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;tex=
t-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:=
none;display:inline!important">To post to this group, send email to<span cl=
ass=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span></span><a =
href=3D"mailto:actionml-user@googlegroups.com" style=3D"font-family:Helveti=
ca;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:no=
rmal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:=
none;white-space:normal;word-spacing:0px" target=3D"_blank">actionml-user@g=
ooglegroups.<wbr>com</a><span style=3D"font-family:Helvetica;font-size:12px=
;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spaci=
ng:normal;text-align:start;text-indent:0px;text-transform:none;white-space:=
normal;word-spacing:0px;float:none;display:inline!important">.</span><br st=
yle=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-variant-=
caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span s=
tyle=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-variant=
-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text=
-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:n=
one;display:inline!important">To view this discussion on the web visit<span=
 class=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span></span>=
<a href=3D"https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BN=
YDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=3Dema=
il&amp;utm_source=3Dfooter" style=3D"font-family:Helvetica;font-size:12px;f=
ont-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing=
:normal;text-align:start;text-indent:0px;text-transform:none;white-space:no=
rmal;word-spacing:0px" target=3D"_blank">https://groups.google.<wbr>com/d/m=
sgid/actionml-user/<wbr>CAA2BRS%2Boj%<wbr>2BNYDmsNNd2mYM1ZC5CgWwC71W3%<wbr>=
3DEhrO9qeOiKyWXA%40mail.gmail.<wbr>com</a><span style=3D"font-family:Helvet=
ica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:n=
ormal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform=
:none;white-space:normal;word-spacing:0px;float:none;display:inline!importa=
nt">.</span><br style=3D"font-family:Helvetica;font-size:12px;font-style:no=
rmal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text=
-align:start;text-indent:0px;text-transform:none;white-space:normal;word-sp=
acing:0px"><span style=3D"font-family:Helvetica;font-size:12px;font-style:n=
ormal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;tex=
t-align:start;text-indent:0px;text-transform:none;white-space:normal;word-s=
pacing:0px;float:none;display:inline!important">For more options, visit<spa=
n class=3D"m_-6708140893097736223Apple-converted-space">=C2=A0</span></span=
><a href=3D"https://groups.google.com/d/optout" style=3D"font-family:Helvet=
ica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:n=
ormal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform=
:none;white-space:normal;word-spacing:0px" target=3D"_blank">https://groups=
.google.<wbr>com/d/optout</a><span style=3D"font-family:Helvetica;font-size=
:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-=
spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-s=
pace:normal;word-spacing:0px;float:none;display:inline!important">.</span><=
/font></span></div></div><br></div></div></blockquote></div><br><br clear=
=3D"all"><br>-- <br><div class=3D"gmail_signature" data-smartmail=3D"gmail_=
signature"><div dir=3D"ltr"><table cellspacing=3D"0" cellpadding=3D"2" bord=
er=3D"0"><tbody><tr><td><a href=3D"http://www.vicomtech.org" target=3D"_bla=
nk"><img src=3D"http://www.vicomtech.org/firmas/html/Vicomtech209.png" widt=
h=3D"209px" height=3D"50px" border=3D"0"></a></td></tr><tr><td><br><span st=
yle=3D"font-size:12px;color:black;font-family:CENTURY GOTHIC;font-weight:bo=
ld">Noelia Os=C3=A9s Fern=C3=A1ndez, PhD</span><br><span style=3D"font-size=
:12px;color:black;font-family:CENTURY GOTHIC">Senior Researcher |<br>Invest=
igadora Senior</span><br></td></tr><tr><td style=3D"border-width:2px;border=
-color:#00abc9;border-bottom-style:solid"><br><span style=3D"font-size:12px=
;color:black;font-family:CENTURY GOTHIC"><a href=3D"mailto:noses@vicomtech.=
org" style=3D"color:black" target=3D"_blank">noses@vicomtech.org</a></span>=
<br><span style=3D"font-size:12px;color:black;font-family:CENTURY GOTHIC">+=
[34]=C2=A0943=C2=A030=C2=A092=C2=A030</span></td></tr><tr><td><span style=
=3D"font-size:11px;color:black;font-family:CENTURY GOTHIC">Data Intelligenc=
e for Energy and<br>Industrial Processes | Inteligencia<br>de Datos para En=
erg=C3=ADa y Procesos<br>Industriales</span><br></td></tr><tr><td><br><a hr=
ef=3D"https://www.linkedin.com/company/vicomtech" target=3D"_blank"><img sr=
c=3D"http://www.vicomtech.org/firmas/html/linkedinCuadrado.png" longdesc=3D=
"https://ci3.googleusercontent.com/proxy/hW852P1NQyBr95ExDzqjjxhidZSIWKCCUd=
U1VT29kxBMDqN19A=3Ds0-d-e1-ft#http://Linkedin" border=3D"0"></a>=C2=A0<a hr=
ef=3D"https://www.youtube.com/user/VICOMTech" target=3D"_blank"><img src=3D=
"http://www.vicomtech.org/firmas/html/youtubeCuadrado.png" longdesc=3D"http=
s://ci4.googleusercontent.com/proxy/AnwyIZ_mq9hO7MAdBT799prJM8zvoMuTVX3TlSd=
GwW8lEzoH=3Ds0-d-e1-ft#http://YouTube" border=3D"0"></a>=C2=A0<a href=3D"ht=
tps://twitter.com/@Vicomtech_IK4" target=3D"_blank"><img src=3D"http://www.=
vicomtech.org/firmas/html/twitterCuadrado.png" longdesc=3D"https://ci4.goog=
leusercontent.com/proxy/yobGDeBa5vD8JNPMjMOSuuqnm76ITf8qsSr_hssY10Xy7jZb=3D=
s0-d-e1-ft#http://Twitter" border=3D"0"></a></td></tr><tr><td><br><span sty=
le=3D"font-size:12px;color:black;font-family:CENTURY GOTHIC">member of:=C2=
=A0<a href=3D"http://www.graphicsmedia.net/" target=3D"_blank"><img src=3D"=
http://www.vicomtech.org/firmas/html/gmn68.png" longdesc=3D"https://ci6.goo=
gleusercontent.com/proxy/56MP72EbETJKMabagk3wanWxpF4rXW-fgRKFsT0ioMI-W_jElH=
l4xInIBxwz=3Ds0-d-e1-ft#http://GraphicsMediaNet" style=3D"vertical-align:mi=
ddle" width=3D"68px" height=3D"25px" border=3D"0"></a>=C2=A0=C2=A0=C2=A0=C2=
=A0<a href=3D"http://www.ik4.es" target=3D"_blank"><img src=3D"http://www.v=
icomtech.org/firmas/html/IK4_43.png" longdesc=3D"https://ci5.googleusercont=
ent.com/proxy/q30xZwwA01h1TWHBqQ87OzjUODQQFFxVPgcf7kwux00=3Ds0-d-e1-ft#http=
://IK4" style=3D"vertical-align:middle" width=3D"43px" height=3D"24px" bord=
er=3D"0"></a></span></td></tr><tr><td><br><span style=3D"font-size:10px;col=
or:black;font-family:CENTURY GOTHIC;font-weight:normal;font-style:italic"><=
a href=3D"http://www.vicomtech.org/en/proteccion-datos" style=3D"color:blac=
k" target=3D"_blank">Legal Notice - Privacy policy</a></span></td></tr></tb=
ody></table></div></div>
</div>

--001a11442a10d4bfb7055e7ad490--