Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1AAF7200CB5 for ; Wed, 12 Jul 2017 18:53:51 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1927C169804; Wed, 12 Jul 2017 16:53:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E1DB31697F1 for ; Wed, 12 Jul 2017 18:53:49 +0200 (CEST) Received: (qmail 94830 invoked by uid 500); 12 Jul 2017 16:53:49 -0000 Mailing-List: contact user-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@predictionio.incubator.apache.org Delivered-To: mailing list user@predictionio.incubator.apache.org Received: (qmail 94821 invoked by uid 99); 12 Jul 2017 16:53:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jul 2017 16:53:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 95D32195CCA for ; Wed, 12 Jul 2017 16:53:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.522 X-Spam-Level: X-Spam-Status: No, score=-2.522 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id FHVQxEWFNrvz for ; Wed, 12 Jul 2017 16:53:44 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 15B0C5FBBD for ; Wed, 12 Jul 2017 16:53:36 +0000 (UTC) Received: (qmail 94351 invoked by uid 99); 12 Jul 2017 16:53:36 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jul 2017 16:53:36 +0000 Received: from mail-lf0-f42.google.com (mail-lf0-f42.google.com [209.85.215.42]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 3ECD91A0029 for ; Wed, 12 Jul 2017 16:53:34 +0000 (UTC) Received: by mail-lf0-f42.google.com with SMTP id z78so22564779lff.0 for ; Wed, 12 Jul 2017 09:53:34 -0700 (PDT) X-Gm-Message-State: AIVw113PoOoT3OOgdvui2hUBsE/wIxIZuLD6C1Y8VQgrYFgtdAd3djK7 /DtDASCAX+iw0Ykt29IU/qVtbksfNQ== X-Received: by 10.25.22.212 with SMTP id 81mr2263951lfw.104.1499878412827; Wed, 12 Jul 2017 09:53:32 -0700 (PDT) MIME-Version: 1.0 References: <63B094F2-1EDE-4649-AC2C-9EB39135CC59@heroku.com> <9D0DD1A7-64A6-4E38-9A3E-4C4BF35E789B@occamsmachete.com> <12CAE521-1F3C-4779-8AC4-988D9D7DFB87@heroku.com> <3ECA1BDF-758D-4395-8699-32677FF546BB@occamsmachete.com> <2170929D-189D-4E69-BD56-1225A0067ADB@occamsmachete.com> <2A2F0CC6-400E-4AC3-B42A-3FF98618A8AA@occamsmachete.com> In-Reply-To: From: Donald Szeto Date: Wed, 12 Jul 2017 16:53:21 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Eventserver API in an Engine? To: user@predictionio.incubator.apache.org Content-Type: multipart/alternative; boundary="001a1140202200c42f055421a813" archived-at: Wed, 12 Jul 2017 16:53:51 -0000 --001a1140202200c42f055421a813 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Many good discussions. Let me provide my input on these issues. Multiple installations of PredictionIO should use different database names. An analogy would be Wordpress installations that expect its own metadata database. I understand the downside to this is that some users only have access to one database. We can add database table prefixing support to alleviate this like most other projects do. I agree it is not very clear in the documentation that installations of PIO should not be backed by overlapping data stores. Regarding the discussion of data and engine, here's what it seems to me: two directions of data science development. One perspective is that data collection and processing is independent from data science development. Data are collected and organized ("apps" in PIO term). Developers go look at what's available, explore, and develop (engines). The other one is to provide turnkey solutions. Well crafted engines expect certain inputs and expose knobs for tuning. PIO supports both styles today. Apps provide the grouping of data, and engine is the abstraction to define the concern of data. These are well defined from day 1. Side track: a confusion I feel here is that templates have different degree of sophistication. The universal recommender is definitely much more sophisticated and turnkey than the skeleton template for example. We should label this in our template gallery. Going back to Mars suggestion. If the use case is such that the engine server also collects data used by only the engine, it feels like the right abstraction would be embedding a subset of event server that collects data going to a single app. Recall that app name is configured in engine.json. I think to resolve Mars immediate need, we can implement embedded event server in a couple phases. Roughly it would be wiring the existing event server in (with some refactoring) and mark it experimental, then continue toward a clean, app-specific event server. Let me know how these sound. On Tue, Jul 11, 2017 at 1:39 PM Kenneth Chan wrote: > re: > " > when deploying multiple engines with different versions of PIO and > different storage configurations .... > > needing separate PIO installs regularly when testing the next release or > development builds of PIO and when evaluating engine templates or > algorithms that require new, different storage configs. Also, those in th= e > consulting world are frequently required to keep client data separated fo= r > all kinds of privacy & legal reasons; with the storage corruption bug I > reported, one client's data could become visible to or intermingled with > another client's app. > " > > when install multiple PIO separately, could you set the each PIO DataBase > config to use different table name so they don't conflict? > or bring up another VM to isolate PIO? > > Donald, do you have best practice or advice if user want to install > multiple PIO versions and able to run them in the same machine? > > > > On Tue, Jul 11, 2017 at 12:49 PM, Kenneth Chan wrote= : > >> I think we are having wrong impression that every template are supposed >> to work together out of the box. >> >> The templates are meant to be examples and demonstration - that's why >> they are called template! they are never meant to be fit into any user >> application right away. Each application has its uniqueness. The templat= e >> only assume a specific use case for demonstration purpose. >> >> User can start with template for simple case but they need to modify for >> their final needs. >> >> For example, the PIO classification template is only meant for >> demonstrating simple classification. At the end, how to use classificati= on >> is application specific. For example, one can modify the classification = to >> train a classifier on the same set of data used by recommendation. >> >> >> >> >> On Tue, Jul 11, 2017 at 10:31 AM, Pat Ferrel >> wrote: >> >>> Understood, you have immediate practical reasons for 1 integrated >>> deployment with the 2 endpoints. But Apache is a do-ology, meaning thos= e >>> who do something win the argument as long as they have enough consensus= . I >>> have enough experience with PIO that I have chosen to fix a lot of issu= es >>> with the prototype design, having already gone down the =E2=80=9Cquick = hack=E2=80=9D path >>> once. You may want to do something else if you have the resources. >>> >>> I fear that my deeper changes will not get enough consensus and we may >>> end up with a competing ML/AI server framework some day. That is anothe= r >>> ASF tendency. Innovations happen before going into ASF, often not under= ASF >>> rules. >>> >>> In any case=E2=80=94how much of your problem is workflow vs installatio= n vs >>> bundling of APIs? Can you explain it more? >>> >>> >>> On Jul 11, 2017, at 9:37 AM, Mars Hall wrote: >>> >>> > On Jul 10, 2017, at 18:03, Kenneth Chan wrote: >>> > >>> > it's all same set of events collected for my application and i can >>> create multiple engine to use these data for different purpose. >>> >>> >>> Clear to me, =E2=AC=86=EF=B8=8F this is the prevailing reasoning behind= the >>> "separateness" of the Eventserver. I do not foresake this design goal, = but >>> ask that we consider the usability & durability of PredictionIO when >>> deploying multiple engines with different versions of PIO and different >>> storage configurations. This will probably happen for anyone who uses >>> PredictionIO long-term in production, as their new projects come on-lin= e >>> with newer & better versions & configurations. >>> >>> I encounter this situation of needing separate PIO installs regularly >>> when testing the next release or development builds of PIO and when >>> evaluating engine templates or algorithms that require new, different >>> storage configs. Also, those in the consulting world are frequently >>> required to keep client data separated for all kinds of privacy & legal >>> reasons; with the storage corruption bug I reported, one client's data >>> could become visible to or intermingled with another client's app. >>> >>> In starting this thread, I was hoping to find some traction with the >>> idea of making it possible to completely self-contain a PredictionIO ap= p by >>> adding the Events API to the process started with `pio deploy`. >>> >>> Goal: Queries & Events APIs in the same process. >>> >>> When considering the architecture of apps, sharing a database between >>> two or more apps is considered a very naughty way to get around having >>> clear, clean, inter-process API's. My team at Salesforce/Heroku has bee= n >>> struck by this exact issue with PredictionIO. So, I am seeking a way to= fix >>> this without requiring a rewrite of PredictionIO. I am excited to hear >>> about the new architecture prototypes, yet our reality is that this is = an >>> issue now. >>> >>> *Mars >>> >>> ( <> .. <> ) >>> >>> >>> >> > --001a1140202200c42f055421a813 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Many good discussions. Let me provide my input = on these issues.

Multipl= e installations of PredictionIO should use different database names. An ana= logy would be Wordpress installations that expect its own metadata database= . I understand the downside to this is that some users only have access to = one database. We can add database table prefixing support to alleviate this= like most other projects do. I agree it is not very clear in the documenta= tion that installations of PIO should not be backed by overlapping data sto= res.

Regardi= ng the discussion of data and engine, here's what it seems to me: two d= irections of data science development.

One perspective is that data collection and processing is in= dependent from data science development. Data are collected and organized (= "apps" in PIO term). Developers go look at what's available, = explore, and develop (engines).

The other one is to provide turnkey solutions. Well crafted engin= es expect certain inputs and expose knobs for tuning.

PIO supports both styles today. Apps provide = the grouping of data, and engine is the abstraction to define the concern o= f data. These are well defined from day 1.

Side track: a confusion I feel here is that templates ha= ve different degree of sophistication. The universal recommender is definit= ely much more sophisticated and turnkey than the skeleton template for exam= ple. We should label this in our template gallery.
<= br>
Going back to Mars suggestion. If the use case i= s such that the engine server also collects data used by only the engine, i= t feels like the right abstraction would be embedding a subset of event ser= ver that collects data going to a single app. Recall that app name is confi= gured in engine.json.

I = think to resolve Mars immediate need, we can implement embedded event serve= r in a couple phases. Roughly it would be wiring the existing event server = in (with some refactoring) and mark it experimental, then continue toward a= clean, app-specific event server.

Let me know how these sound.

On Tue, Jul 11, 2017 at 1:39 PM Kenneth Chan <kenneth@apache.org> wr= ote:
re:=C2=A0
"
when deploying multiple engines with different versions of PIO and diff= erent storage configurations ....

needing separate PIO installs regularly when testing the next release o= r development builds of PIO and when evaluating engine templates or algorit= hms that require new, different storage configs. Also, those in the consult= ing world are frequently required to keep client data separated for all kin= ds of privacy & legal reasons; with the storage corruption bug I report= ed, one client's data could become visible to or intermingled with anot= her client's app.
"

when install multiple PIO separately, could you set the e= ach PIO DataBase config to use different table name so they don't confl= ict?
or bring up another = VM to isolate PIO?

Donald, do you have best p= ractice or advice if user want to install multiple PIO versions and able to= run them in the same machine?



On Tue, = Jul 11, 2017 at 12:49 PM, Kenneth Chan <kenneth@apache.org> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
I think we are having wrong impress= ion that every template are supposed to work together out of the box.=C2=A0=

The templates are meant to be examples and demonstrati= on - that's why they are called template! they are never meant to be fi= t into any user application right away. Each application has its uniqueness= . The template only assume a specific use case for demonstration purpose.
User can start with template for simple case but they nee= d to modify for their final needs.

For example= , the PIO classification template is only meant for demonstrating simple cl= assification. At the end, how to use classification is application specific= . For example, one can modify the classification to train a classifier on t= he same set of data used by recommendation.




On Tue, Jul 11, = 2017 at 10:31 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
Understood, you have immediate practical reasons = for 1 integrated deployment with the 2 endpoints. But Apache is a do-ology,= meaning those who do something win the argument as long as they have enoug= h consensus. I have enough experience with PIO that I have chosen to fix a = lot of issues with the prototype design, having already gone down the =E2= =80=9Cquick hack=E2=80=9D path once. You may want to do something else if y= ou have the resources.

I fear that my deeper changes will not get enough consensus and we may end = up with a competing ML/AI server framework some day. That is another ASF te= ndency. Innovations happen before going into ASF, often not under ASF rules= .

In any case=E2=80=94how much of your problem is workflow vs installation vs= bundling of APIs? Can you explain it more?


On Jul 11, 2017, at 9:37 AM, Mars Hall <mars@heroku.com> wrote:

> On Jul 10, 2017, at 18:03, Kenneth Chan <kenneth@apache.org> wrote:
>
> it's all same set of events collected for my application and i can= create multiple engine to use these data for different purpose.


Clear to me, =E2=AC=86=EF=B8=8F this is the prevailing reasoning behind the= "separateness" of the Eventserver. I do not foresake this design= goal, but ask that we consider the usability & durability of Predictio= nIO when deploying multiple engines with different versions of PIO and diff= erent storage configurations. This will probably happen for anyone who uses= PredictionIO long-term in production, as their new projects come on-line w= ith newer & better versions & configurations.

I encounter this situation of needing separate PIO installs regularly when = testing the next release or development builds of PIO and when evaluating e= ngine templates or algorithms that require new, different storage configs. = Also, those in the consulting world are frequently required to keep client = data separated for all kinds of privacy & legal reasons; with the stora= ge corruption bug I reported, one client's data could become visible to= or intermingled with another client's app.

In starting this thread, I was hoping to find some traction with the idea o= f making it possible to completely self-contain a PredictionIO app by addin= g the Events API to the process started with `pio deploy`.

Goal: Queries & Events APIs in the same process.

When considering the architecture of apps, sharing a database between two o= r more apps is considered a very naughty way to get around having clear, cl= ean, inter-process API's. My team at Salesforce/Heroku has been struck = by this exact issue with PredictionIO. So, I am seeking a way to fix this w= ithout requiring a rewrite of PredictionIO. I am excited to hear about the = new architecture prototypes, yet our reality is that this is an issue now.<= br>
*Mars

( <> .. <> )




--001a1140202200c42f055421a813--