Mailing-List: contact user-help@predictionio.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@predictionio.incubator.apache.org
MIME-Version: 1.0
References: <63B094F2-1EDE-4649-AC2C-9EB39135CC59@heroku.com>
 <CAD8z1J+vb+yj0MmDtUTZR8kk-ikQ5sktvo-9qqie5Oz87iVq7w@mail.gmail.com>
 <9D0DD1A7-64A6-4E38-9A3E-4C4BF35E789B@occamsmachete.com> <12CAE521-1F3C-4779-8AC4-988D9D7DFB87@heroku.com>
 <3ECA1BDF-758D-4395-8699-32677FF546BB@occamsmachete.com> <CANT7bZfngE7BR5dQqq2P2iTGSiPAK3KbwEABxEzBW4w10bKKBg@mail.gmail.com>
 <BF9AEF06-2AEF-47B0-A210-43B48C55988A@occamsmachete.com> <CANT7bZdUeH-q=o8CPOvFcvWwN9iDn_VC5OsgEAVZoT8mW8TjRA@mail.gmail.com>
 <CANT7bZdCZWxdTbGrm9a6v+8o=AiRJfsRniy7rX6MS4zpQ0rHAg@mail.gmail.com>
 <AC484AC2-BF35-44F3-9F91-A43AB942B3B2@occamsmachete.com> <CANT7bZcmwSp32xPa3tbNnYKGdO4q0PCJTTCnhQM57bxkLsGwMA@mail.gmail.com>
 <2170929D-189D-4E69-BD56-1225A0067ADB@occamsmachete.com> <CANT7bZcmdFrOSQVpjWYO31ogNeOLjjVHsmXu0GgVgmG_9W9kAQ@mail.gmail.com>
 <CANT7bZd-fCjh0Qv_7=BTkMrBANTRP3uguFzBii+ZGCnPq7420Q@mail.gmail.com>
 <CA75E92A-5260-4E81-9341-3F4714B1CBBD@heroku.com> <2A2F0CC6-400E-4AC3-B42A-3FF98618A8AA@occamsmachete.com>
 <CANT7bZe_wp_Q4j4=DS7-fYXadcT0xa6=xzPBJ9iLqOLYLBaUpg@mail.gmail.com> <CANT7bZccC3vtpk48vXuNSPyKJG+pbHKYaaja4CKitxJ2yipP3A@mail.gmail.com>
In-Reply-To: <CANT7bZccC3vtpk48vXuNSPyKJG+pbHKYaaja4CKitxJ2yipP3A@mail.gmail.com>
From: Donald Szeto <donald@apache.org>
Date: Wed, 12 Jul 2017 16:53:21 +0000
Message-ID: <CAD8z1JL+XnTCVS9odpKCFORPVLjqkAb5Zrc3ZmpPk2mgcNzD+g@mail.gmail.com>
Subject: Re: Eventserver API in an Engine?
To: user@predictionio.incubator.apache.org
Content-Type: multipart/alternative; boundary="001a1140202200c42f055421a813"
archived-at: Wed, 12 Jul 2017 16:53:51 -0000

--001a1140202200c42f055421a813
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Many good discussions. Let me provide my input on these issues.

Multiple installations of PredictionIO should use different database names.
An analogy would be Wordpress installations that expect its own metadata
database. I understand the downside to this is that some users only have
access to one database. We can add database table prefixing support to
alleviate this like most other projects do. I agree it is not very clear in
the documentation that installations of PIO should not be backed by
overlapping data stores.

Regarding the discussion of data and engine, here's what it seems to me:
two directions of data science development.

One perspective is that data collection and processing is independent from
data science development. Data are collected and organized ("apps" in PIO
term). Developers go look at what's available, explore, and develop
(engines).

The other one is to provide turnkey solutions. Well crafted engines expect
certain inputs and expose knobs for tuning.

PIO supports both styles today. Apps provide the grouping of data, and
engine is the abstraction to define the concern of data. These are well
defined from day 1.

Side track: a confusion I feel here is that templates have different degree
of sophistication. The universal recommender is definitely much more
sophisticated and turnkey than the skeleton template for example. We should
label this in our template gallery.

Going back to Mars suggestion. If the use case is such that the engine
server also collects data used by only the engine, it feels like the right
abstraction would be embedding a subset of event server that collects data
going to a single app. Recall that app name is configured in engine.json.

I think to resolve Mars immediate need, we can implement embedded event
server in a couple phases. Roughly it would be wiring the existing event
server in (with some refactoring) and mark it experimental, then continue
toward a clean, app-specific event server.

Let me know how these sound.

On Tue, Jul 11, 2017 at 1:39 PM Kenneth Chan <kenneth@apache.org> wrote:

> re:
> "
> when deploying multiple engines with different versions of PIO and
> different storage configurations ....
>
> needing separate PIO installs regularly when testing the next release or
> development builds of PIO and when evaluating engine templates or
> algorithms that require new, different storage configs. Also, those in th=
e
> consulting world are frequently required to keep client data separated fo=
r
> all kinds of privacy & legal reasons; with the storage corruption bug I
> reported, one client's data could become visible to or intermingled with
> another client's app.
> "
>
> when install multiple PIO separately, could you set the each PIO DataBase
> config to use different table name so they don't conflict?
> or bring up another VM to isolate PIO?
>
> Donald, do you have best practice or advice if user want to install
> multiple PIO versions and able to run them in the same machine?
>
>
>
> On Tue, Jul 11, 2017 at 12:49 PM, Kenneth Chan <kenneth@apache.org> wrote=
:
>
>> I think we are having wrong impression that every template are supposed
>> to work together out of the box.
>>
>> The templates are meant to be examples and demonstration - that's why
>> they are called template! they are never meant to be fit into any user
>> application right away. Each application has its uniqueness. The templat=
e
>> only assume a specific use case for demonstration purpose.
>>
>> User can start with template for simple case but they need to modify for
>> their final needs.
>>
>> For example, the PIO classification template is only meant for
>> demonstrating simple classification. At the end, how to use classificati=
on
>> is application specific. For example, one can modify the classification =
to
>> train a classifier on the same set of data used by recommendation.
>>
>>
>>
>>
>> On Tue, Jul 11, 2017 at 10:31 AM, Pat Ferrel <pat@occamsmachete.com>
>> wrote:
>>
>>> Understood, you have immediate practical reasons for 1 integrated
>>> deployment with the 2 endpoints. But Apache is a do-ology, meaning thos=
e
>>> who do something win the argument as long as they have enough consensus=
. I
>>> have enough experience with PIO that I have chosen to fix a lot of issu=
es
>>> with the prototype design, having already gone down the =E2=80=9Cquick =
hack=E2=80=9D path
>>> once. You may want to do something else if you have the resources.
>>>
>>> I fear that my deeper changes will not get enough consensus and we may
>>> end up with a competing ML/AI server framework some day. That is anothe=
r
>>> ASF tendency. Innovations happen before going into ASF, often not under=
 ASF
>>> rules.
>>>
>>> In any case=E2=80=94how much of your problem is workflow vs installatio=
n vs
>>> bundling of APIs? Can you explain it more?
>>>
>>>
>>> On Jul 11, 2017, at 9:37 AM, Mars Hall <mars@heroku.com> wrote:
>>>
>>> > On Jul 10, 2017, at 18:03, Kenneth Chan <kenneth@apache.org> wrote:
>>> >
>>> > it's all same set of events collected for my application and i can
>>> create multiple engine to use these data for different purpose.
>>>
>>>
>>> Clear to me, =E2=AC=86=EF=B8=8F this is the prevailing reasoning behind=
 the
>>> "separateness" of the Eventserver. I do not foresake this design goal, =
but
>>> ask that we consider the usability & durability of PredictionIO when
>>> deploying multiple engines with different versions of PIO and different
>>> storage configurations. This will probably happen for anyone who uses
>>> PredictionIO long-term in production, as their new projects come on-lin=
e
>>> with newer & better versions & configurations.
>>>
>>> I encounter this situation of needing separate PIO installs regularly
>>> when testing the next release or development builds of PIO and when
>>> evaluating engine templates or algorithms that require new, different
>>> storage configs. Also, those in the consulting world are frequently
>>> required to keep client data separated for all kinds of privacy & legal
>>> reasons; with the storage corruption bug I reported, one client's data
>>> could become visible to or intermingled with another client's app.
>>>
>>> In starting this thread, I was hoping to find some traction with the
>>> idea of making it possible to completely self-contain a PredictionIO ap=
p by
>>> adding the Events API to the process started with `pio deploy`.
>>>
>>> Goal: Queries & Events APIs in the same process.
>>>
>>> When considering the architecture of apps, sharing a database between
>>> two or more apps is considered a very naughty way to get around having
>>> clear, clean, inter-process API's. My team at Salesforce/Heroku has bee=
n
>>> struck by this exact issue with PredictionIO. So, I am seeking a way to=
 fix
>>> this without requiring a rewrite of PredictionIO. I am excited to hear
>>> about the new architecture prototypes, yet our reality is that this is =
an
>>> issue now.
>>>
>>> *Mars
>>>
>>> ( <> .. <> )
>>>
>>>
>>>
>>
>

--001a1140202200c42f055421a813
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div><div><div dir=3D"auto">Many good discussions. Let me provide my input =
on these issues.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Multipl=
e installations of PredictionIO should use different database names. An ana=
logy would be Wordpress installations that expect its own metadata database=
. I understand the downside to this is that some users only have access to =
one database. We can add database table prefixing support to alleviate this=
 like most other projects do. I agree it is not very clear in the documenta=
tion that installations of PIO should not be backed by overlapping data sto=
res.</div></div></div><div dir=3D"auto"><br></div><div dir=3D"auto">Regardi=
ng the discussion of data and engine, here&#39;s what it seems to me: two d=
irections of data science development.</div><div dir=3D"auto"><br></div><di=
v dir=3D"auto">One perspective is that data collection and processing is in=
dependent from data science development. Data are collected and organized (=
&quot;apps&quot; in PIO term). Developers go look at what&#39;s available, =
explore, and develop (engines).</div><div dir=3D"auto"><br></div><div dir=
=3D"auto">The other one is to provide turnkey solutions. Well crafted engin=
es expect certain inputs and expose knobs for tuning.</div><div dir=3D"auto=
"><br></div><div dir=3D"auto">PIO supports both styles today. Apps provide =
the grouping of data, and engine is the abstraction to define the concern o=
f data. These are well defined from day 1.</div><div dir=3D"auto"><br></div=
><div dir=3D"auto">Side track: a confusion I feel here is that templates ha=
ve different degree of sophistication. The universal recommender is definit=
ely much more sophisticated and turnkey than the skeleton template for exam=
ple. We should label this in our template gallery.</div><div dir=3D"auto"><=
br></div><div dir=3D"auto">Going back to Mars suggestion. If the use case i=
s such that the engine server also collects data used by only the engine, i=
t feels like the right abstraction would be embedding a subset of event ser=
ver that collects data going to a single app. Recall that app name is confi=
gured in engine.json.</div><div dir=3D"auto"><br></div><div dir=3D"auto">I =
think to resolve Mars immediate need, we can implement embedded event serve=
r in a couple phases. Roughly it would be wiring the existing event server =
in (with some refactoring) and mark it experimental, then continue toward a=
 clean, app-specific event server.</div><div dir=3D"auto"><br></div><div di=
r=3D"auto">Let me know how these sound.</div><div><div><br><div class=3D"gm=
ail_quote"><div>On Tue, Jul 11, 2017 at 1:39 PM Kenneth Chan &lt;<a href=3D=
"mailto:kenneth@apache.org" target=3D"_blank">kenneth@apache.org</a>&gt; wr=
ote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div>re:=C2=A0<div><span style=
=3D"font-size:12.8px">&quot;</span></div><div><span style=3D"font-size:12.8=
px">when deploying multiple engines with different versions of PIO and diff=
erent storage configurations ....</span><br></div></div><div><div><span sty=
le=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8=
px">needing separate PIO installs regularly when testing the next release o=
r development builds of PIO and when evaluating engine templates or algorit=
hms that require new, different storage configs. Also, those in the consult=
ing world are frequently required to keep client data separated for all kin=
ds of privacy &amp; legal reasons; with the storage corruption bug I report=
ed, one client&#39;s data could become visible to or intermingled with anot=
her client&#39;s app.</span><span style=3D"font-size:12.8px"><br></span></d=
iv><div><span style=3D"font-size:12.8px">&quot;</span><br></div><div><span =
style=3D"font-size:12.8px"><br></span></div></div><div><div><span style=3D"=
font-size:12.8px">when install multiple PIO separately, could you set the e=
ach PIO DataBase config to use different table name so they don&#39;t confl=
ict?</span></div><div><span style=3D"font-size:12.8px">or bring up another =
VM to isolate PIO?</span></div><div><span style=3D"font-size:12.8px"><br></=
span></div><div><span style=3D"font-size:12.8px">Donald, do you have best p=
ractice or advice if user want to install multiple PIO versions and able to=
 run them in the same machine?</span></div><div><span style=3D"font-size:12=
.8px"><br></span></div><div><span style=3D"font-size:12.8px"><br></span></d=
iv></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, =
Jul 11, 2017 at 12:49 PM, Kenneth Chan <span>&lt;<a href=3D"mailto:kenneth@=
apache.org" target=3D"_blank">kenneth@apache.org</a>&gt;</span> wrote:<br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div><div>I think we are having wrong impress=
ion that every template are supposed to work together out of the box.=C2=A0=
</div><div><br></div>The templates are meant to be examples and demonstrati=
on - that&#39;s why they are called template! they are never meant to be fi=
t into any user application right away. Each application has its uniqueness=
. The template only assume a specific use case for demonstration purpose.<d=
iv><br></div><div>User can start with template for simple case but they nee=
d to modify for their final needs.<br></div><div><br></div><div>For example=
, the PIO classification template is only meant for demonstrating simple cl=
assification. At the end, how to use classification is application specific=
. For example, one can modify the classification to train a classifier on t=
he same set of data used by recommendation.</div><div><br></div><div><br></=
div><div><br></div></div><div class=3D"m_-3966018440846578936m_320767762075=
5088638HOEnZb"><div class=3D"m_-3966018440846578936m_3207677620755088638h5"=
><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Jul 11, =
2017 at 10:31 AM, Pat Ferrel <span>&lt;<a href=3D"mailto:pat@occamsmachete.=
com" target=3D"_blank">pat@occamsmachete.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">Understood, you have immediate practical reasons =
for 1 integrated deployment with the 2 endpoints. But Apache is a do-ology,=
 meaning those who do something win the argument as long as they have enoug=
h consensus. I have enough experience with PIO that I have chosen to fix a =
lot of issues with the prototype design, having already gone down the =E2=
=80=9Cquick hack=E2=80=9D path once. You may want to do something else if y=
ou have the resources.<br>
<br>
I fear that my deeper changes will not get enough consensus and we may end =
up with a competing ML/AI server framework some day. That is another ASF te=
ndency. Innovations happen before going into ASF, often not under ASF rules=
.<br>
<br>
In any case=E2=80=94how much of your problem is workflow vs installation vs=
 bundling of APIs? Can you explain it more?<br>
<div class=3D"m_-3966018440846578936m_3207677620755088638m_-920468797575093=
3521HOEnZb"><div class=3D"m_-3966018440846578936m_3207677620755088638m_-920=
4687975750933521h5"><br>
<br>
On Jul 11, 2017, at 9:37 AM, Mars Hall &lt;<a href=3D"mailto:mars@heroku.co=
m" target=3D"_blank">mars@heroku.com</a>&gt; wrote:<br>
<br>
&gt; On Jul 10, 2017, at 18:03, Kenneth Chan &lt;<a href=3D"mailto:kenneth@=
apache.org" target=3D"_blank">kenneth@apache.org</a>&gt; wrote:<br>
&gt;<br>
&gt; it&#39;s all same set of events collected for my application and i can=
 create multiple engine to use these data for different purpose.<br>
<br>
<br>
Clear to me, =E2=AC=86=EF=B8=8F this is the prevailing reasoning behind the=
 &quot;separateness&quot; of the Eventserver. I do not foresake this design=
 goal, but ask that we consider the usability &amp; durability of Predictio=
nIO when deploying multiple engines with different versions of PIO and diff=
erent storage configurations. This will probably happen for anyone who uses=
 PredictionIO long-term in production, as their new projects come on-line w=
ith newer &amp; better versions &amp; configurations.<br>
<br>
I encounter this situation of needing separate PIO installs regularly when =
testing the next release or development builds of PIO and when evaluating e=
ngine templates or algorithms that require new, different storage configs. =
Also, those in the consulting world are frequently required to keep client =
data separated for all kinds of privacy &amp; legal reasons; with the stora=
ge corruption bug I reported, one client&#39;s data could become visible to=
 or intermingled with another client&#39;s app.<br>
<br>
In starting this thread, I was hoping to find some traction with the idea o=
f making it possible to completely self-contain a PredictionIO app by addin=
g the Events API to the process started with `pio deploy`.<br>
<br>
Goal: Queries &amp; Events APIs in the same process.<br>
<br>
When considering the architecture of apps, sharing a database between two o=
r more apps is considered a very naughty way to get around having clear, cl=
ean, inter-process API&#39;s. My team at Salesforce/Heroku has been struck =
by this exact issue with PredictionIO. So, I am seeking a way to fix this w=
ithout requiring a rewrite of PredictionIO. I am excited to hear about the =
new architecture prototypes, yet our reality is that this is an issue now.<=
br>
<br>
*Mars<br>
<br>
( &lt;&gt; .. &lt;&gt; )<br>
<br>
<br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div></div></div>

--001a1140202200c42f055421a813--