Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E2F94200CD0 for ; Tue, 11 Jul 2017 02:59:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E13B11617B7; Tue, 11 Jul 2017 00:59:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E334A161169 for ; Tue, 11 Jul 2017 02:59:16 +0200 (CEST) Received: (qmail 4534 invoked by uid 500); 11 Jul 2017 00:59:15 -0000 Mailing-List: contact user-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@predictionio.incubator.apache.org Delivered-To: mailing list user@predictionio.incubator.apache.org Received: (qmail 4525 invoked by uid 99); 11 Jul 2017 00:59:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jul 2017 00:59:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BCB9A194E60 for ; Tue, 11 Jul 2017 00:59:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.021 X-Spam-Level: X-Spam-Status: No, score=-3.021 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id K0DoNSMwqHl1 for ; Tue, 11 Jul 2017 00:59:09 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id D5BFD5FB5C for ; Tue, 11 Jul 2017 00:59:07 +0000 (UTC) Received: (qmail 4474 invoked by uid 99); 11 Jul 2017 00:59:07 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jul 2017 00:59:07 +0000 Received: from mail-pf0-f175.google.com (mail-pf0-f175.google.com [209.85.192.175]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 009221A031B for ; Tue, 11 Jul 2017 00:59:04 +0000 (UTC) Received: by mail-pf0-f175.google.com with SMTP id e7so58165188pfk.0 for ; Mon, 10 Jul 2017 17:59:04 -0700 (PDT) X-Gm-Message-State: AIVw112ardcc2qgW0kPWWkyZcaWTOJjhZjPFQnmvPIEOxxo646/6or8I e8hX1rmyp6fCKea1m30S85fh8aiPPg== X-Received: by 10.84.178.131 with SMTP id z3mr20939623plb.271.1499734743681; Mon, 10 Jul 2017 17:59:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.149.137 with HTTP; Mon, 10 Jul 2017 17:59:02 -0700 (PDT) In-Reply-To: <2170929D-189D-4E69-BD56-1225A0067ADB@occamsmachete.com> References: <63B094F2-1EDE-4649-AC2C-9EB39135CC59@heroku.com> <9D0DD1A7-64A6-4E38-9A3E-4C4BF35E789B@occamsmachete.com> <12CAE521-1F3C-4779-8AC4-988D9D7DFB87@heroku.com> <3ECA1BDF-758D-4395-8699-32677FF546BB@occamsmachete.com> <2170929D-189D-4E69-BD56-1225A0067ADB@occamsmachete.com> From: Kenneth Chan Date: Mon, 10 Jul 2017 17:59:02 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Eventserver API in an Engine? To: user@predictionio.incubator.apache.org Content-Type: multipart/alternative; boundary="94eb2c13f228a7924b05540034a9" archived-at: Tue, 11 Jul 2017 00:59:19 -0000 --94eb2c13f228a7924b05540034a9 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable "I understand what you are aiming for=E2=80=94namely data independence from= model and engine=E2=80=94but it is impossible and seems a very odd place to abstr= act when you put it in real terms. A Recommender will never need the same data as a neural net, a clusterer, or a classifier. This abstraction does not exist in the data because it is not there in the algorithm and should not be forced away from the Engine." The data belongs to application, not engine. Engine should use the data to train model. I can train clustering model based on user behavior events to cluster my user, the same events can be used by recommendation engine. i can create another engine to classify my user's intent based on events generated by my application. I can create a neural net based on the product description for NLP purpose. etc On Mon, Jul 10, 2017 at 8:10 AM, Pat Ferrel wrote: > Good to know but if there is an event blocker and sniffer then they shoul= d > be a concern of the Engine. Otherwise you are hiding Engine specifics fro= m > the Engine. The most irrefutable need for the =E2=80=9Cinput=E2=80=9D met= hod is kappa > requirements and Lambda need for realtime changes to the model. > > I understand what you are aiming for=E2=80=94namely data independence fro= m model > and engine=E2=80=94but it is impossible and seems a very odd place to abs= tract when > you put it in real terms. A Recommender will never need the same data as = a > neural net, a clusterer, or a classifier. This abstraction does not exist > in the data because it is not there in the algorithm and should not be > forced away from the Engine. > > BTW the way the prototype server handles this data independence is > allowing the user to ignore the engine (which may be under tuning or > development and not reliable for validation) and simply mirroring > un-validated events (PIO has this built into some client SDKs but this > suffers from getting only a single clients events). Then they can be > replayed or modified as with exported PIO events. The server also imports > these maintaining event level compatibility with PIO. This even works wit= h > Kappa. If you want to re-create a kappa model you simply replay the > mirrored events. But mirroring is optional and likely to be turned off on= ce > the Engine is running correctly. IMO it is a more flexible model than > forcing data independence away from the Engine and maintaining it into th= e > storage layer. > > So far I=E2=80=99ve written 3 PIO Templates from scratch, the UR, The Con= textual > Bandit (MAB type online learner), and the db-cleaner. What I have found > with these rather different algorithms is: > 1) PIO works ok with the UR but could use realtime validation and a bette= r > way of dropping old events. > 2) Kappa doesn=E2=80=99t work well at all with PIO but does with the prot= otype > server > 3) event/dataset compatibility can be maintained between PIO and prototyp= e > Engines. > 4) there is no need for a db-cleaner in the prototype. The Engine persist= s > mutable objects and makes realtime changes to their state, and event > streams can be handled as the Engine needs (Kappa discards without storin= g, > Lambda may store) but since they are separate from $set, $unset these > streams can have db TTLs to age out old data for Lambda Engines. The syst= em > is always self-cleaning with no heavyweight operation required to keep ju= st > the right data (the db cleaner is heavyweight and slow), the data does no= t > grow forever by design. This was never addressed as a design requirement > for PIO and the add-on we did is not a very good solution. > > > On Jul 9, 2017, at 7:09 PM, Kenneth Chan wrote: > > i think there is a philosophical discussion: > 1) as PIO user, should i collect my event data based on my application > uniqueness and ML needs (of course, i can use the template format as > reference), then create engine or modify engine template to use these dat= a > to train model > or > 2) as PIO user, because i'm using this specific engine template, i must > import and transform my data into the exact format required by template, > and send to event server in order to make it work. > > however, regardless of above, PIO event server currently supports "event > blocker" and "event sniffer" to solve these issues you mentioned > 1) "event blocker" can be used for "event validation in real time" - the > engine template can provide a sample event blocker implementation and can > be used to reject improper events. > > 2) "event sniffer" can be used for "forwarding specific event to other > processing system in real time" - the engine template can also provide a > sample sniffer (e.g. send to UR's elasticsearch to update meta data) > > for advanced user, they can modify these based on their application needs > (say, if they have multiple engines). for starter, they may use out of th= e > box along with template. > > see http://mail-archives.apache.org/mod_mbox/incubator- > predictionio-dev/201706.mbox/%3CCAF_HxLtEonOVALSQgrCRGXctAbL7eypxw > G0ErHpaBJJym15j5Q%40mail.gmail.com%3E > > > On Sun, Jul 9, 2017 at 5:28 PM, Pat Ferrel wrote: > >> I must disagree here, The Engine should decide the disposition of data, >> which cannot be left to a generic EventServer. Data is the concern of th= e >> Engine, not the EventSever or PIO framework for these reasons: >> >> 1) input needs to be validated and since it is defined by the Engine it >> seems rather obvious that the Engine must provide an =E2=80=9Cinput=E2= =80=9D method like >> the =E2=80=9Cpredict=E2=80=9D method. This input method parses and valid= ates input >> responding with errors of format that only it knows about. It also decid= es=E2=80=A6 >> 2) a Kappa learner must get data in realtime, and do not save datasets, >> only buffers of data at most. >> 3) Kappa and even some Lambda algorithms need to modify/update the model >> in realtime. Realtime model updates define "Kappa online learners" but >> there are also Lambda learners like the UR that need to update parts of = the >> model when, for instance, item attributes change (out of stock, ...) As = PIO >> stands now this can only be done at train time which is a rather >> troublesome limitation. >> 4) It is the Engine=E2=80=99s concern, whether input modifies mutable or >> immutable data. One engine may use a named event to do something but the >> name of the event is only know by the Engine. So if you agree that data >> come in 2 forms, only the Engine can define and enforce this. >> >> This is certainly not to denigrate the EventStore, which is most >> certainly required by every existing PIO Lambda Engine. But it should be >> the concern of the Engine how it is used and the only way to do this is >> make =E2=80=9Cinput=E2=80=9D the concern of the Engine. This can be done= generically if >> there is truly no validation beyond the current an so does not needlessl= y >> complicate Engines. >> >> I am also not arguing for a different encoding of data. The PIO event >> JSON is quite flexible and I have not seen a need to alter it. However >> because of its flexibility the EventServer cannot really validate it. Th= e >> PIO events are even quite sufficient for Lambda and Kappa data encoding = in >> fact we have a Lambda Template in PIO that we made into a Kappa Template >> with the prototype server and used exactly the same event encoding. Sinc= e >> the prototype requires that the Engine validate it and respond to the in= put >> request, we immediately found event encoding errors that were very serio= us >> and had been in the client for a long time but since the events looked >> perfectly fine to the PIO EventServer, the errors were never detected an= d >> the data was in fact ignored. Within a day of replaying exported PIO eve= nts >> to the prototype server the issue was resolved and fixed in the client. >> >> >> On Jul 8, 2017, at 12:48 AM, Kenneth Chan wrote: >> >> re: "bundling event server as engine" >> >> depending on how we wanna separate the concern. >> >> the way i look at it is decouple 1, data collection service (PIO event >> server) and 2. modeling and prediction service (PIO engine) - that's the >> separation of concern. >> >> Ideally data is agnostic to engine, and should be tied to user >> application. >> The original vision is user collect data, then can create multiple PIO >> engines which use the collected data. >> if combine 1 and 2, how could user create engine A and engine B to train >> model on collected data for different ML use case? >> >> for your input data problem, maybe other way is that the template should >> also provide a "event validator" which can be loaded into event server a= nd >> advanced user can also customize it. >> >> >> >> >> On Sat, Jul 8, 2017 at 12:31 AM, Kenneth Chan wrote= : >> >>> # re: " I see it as objects you see it as data stores" >>> >>> not really. I see things based on what functionality and purpose it >>> provides. like you mentioned - The way Elasticseach is used in UR is pa= rt >>> of the model and where the algorithm write the computation result into = and >>> then used as serving. In a way, it's the model. just a more complex mod= el >>> than a simple linear regression function. >>> If we define "Model" as output of the train() function, then UR is >>> storing the model into Elasticsearch - and it is required because UR re= lies >>> on Elasticsearch computation - meaning it's part of UR's "model".predic= t() >>> >>> >>> # re: "In reality the input comes in 2 types, persistent mutable >>> objects and immutable streams of events (that may well be usable as a t= ime >>> window of data, dropping old events)" >>> >>> like you said, basically there are two types of data type >>> 1. mutable object (e.g meta data of a product, user profile, etc) >>> 2. immutable event (e.g. behavior data) >>> >>> However, 1 can be considered as 2 if we treat the "changes" of mutable >>> object as "event" as well - basically this's the current event server >>> design. >>> >>> But i agree some use case may not care about changes of mutable object = - >>> for this, we can provide some API/option for people to store mutable >>> objects and always overwrite. or use better storage structure to captur= e >>> the changes of mutable object. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Fri, Jun 30, 2017 at 5:29 AM, Pat Ferrel >>> wrote: >>> >>>> Actually I think it=E2=80=99s a great solution. The question about dif= ferent >>>> storage config (https://issues.apache.org/jira/browse/PIO-96) is >>>> because Elasticsearch performs the last step of the algorithm, it is n= ot >>>> just a store for models, so it=E2=80=99s an integral part of the compu= te engine, >>>> not the storage. If it looks that way I hardly think it matters in the= way >>>> implied (see below where Templates should come with compassable >>>> containers). This is actually the primary difference in the way you an= d I >>>> look at the problem. I see it as objects you see it as data stores. Le= t=E2=80=99s >>>> add the question of compute backends and unfortunately users will have= to >>>> pick the solution along with the engines they require (TensorFlow anyo= ne?) >>>> If PIO is going to be a viable ML/AI server in the long term it has to= be a >>>> lot more flexible, not less so. In the proto server I mention, the Eng= ine >>>> decides on the compute backend and the example Template does not use S= park. >>>> >>>> The prototype server I mentioned actually only handles metadata, >>>> installs engines, and mirrors input. To handle Kappa as well as Lambda >>>> algorithms the Engine must decide what and if it needs to store. There= fore >>>> instead of assuming an EventServer we have mirroring of un-validated >>>> events. This has many benefits. For one thing we can require validatio= n >>>> from the Engine with every event. This is because the single most freq= uent >>>> mistake by users I=E2=80=99ve dealt with is malformed input. PIO=E2=80= =99s input scheme is >>>> great because it is so flexible but because of that validation is nil.= I >>>> have seen users that have been using a Template for a year without >>>> understanding that most of their data was ignored by the Template code= (not >>>> the UR in this case) . I have spent literally thousands of hours helpi= ng >>>> correct bad input over email even though the UR has orders of magnitud= e >>>> better docs than any other Template. Yes, it=E2=80=99s also a lot more= complicated >>>> but anyway, I=E2=80=99m tired of this=E2=80=94we need validation of ev= ery input. Then maybe >>>> I will only spend 90% of those hours :-P >>>> >>>> Anyway I think the separation of concerns should be Server handles >>>> metadata, installs engines, and mirrors input. The Template framework >>>> provides required APIs for Engines that must be implemented and a set = of >>>> Tools they can use or ignore to use what ever they need. If the Engine= s >>>> provides an input method they can validate and if they are Kappa, lear= n >>>> immediately (update models in real time), if they are Lambda, store th= e >>>> valid data using something like an Event Store. The train method is th= en >>>> optional and, of course, query. >>>> >>>> BTW the reason I call it a PredictionServer (in PIO) is because it is >>>> not an Engine Server, all it does is provide a query endpoint. This >>>> corresponds to only one method of an Engine and there is no reason to = look >>>> at a query endpoint any differently than the other public APIs of the >>>> Engine. >>>> >>>> I guess I look at this in an object oriented way, not a data oriented >>>> way. This leads to Template code/Engines making more decisions. The Ka= ppa >>>> template we have for this proto server never uses Spark. Why would it = to >>>> implement Kappa online learning? It also does not need an Event Store >>>> because it only stores models. This is also fine for Lambda where an E= vent >>>> Store is required because the Engine provides the input method too, wh= ere >>>> it can make the store/no-store decision. >>>> >>>> This has other benefits. Treating input as an immutable stream has som= e >>>> major flaws. Some of the data has to be dropped, we cannot store forev= er=E2=80=94no >>>> one can afford that much disk. And some data can never be dropped beca= use >>>> only the aggregate of all object changes makes any sense. In reality t= he >>>> input comes in 2 types, persistent mutable objects and immutable strea= ms of >>>> events (that may well be usable as a time window of data, dropping old >>>> events). With the above split, the mirror always has all input in case= it=E2=80=99s >>>> needed, the Engine can decide what events operate on mutable objects a= nd >>>> store the rest as a stream in the Event Store (with TTL for time windo= ws). >>>> Once this is trusted to work correctly mirroring can be stopped. In fa= ct >>>> the mutable objects can affect the model in real time now, even with L= ambda >>>> Templates like the UR. When an object property changes in today=E2=80= =99s PIO we >>>> have to wait till train before the model changes because the Engine do= es >>>> not have an input method. If it did, then input that should affect the >>>> model can. >>>> >>>> This solves all my pet peeves, internal API-wise, and allows one >>>> implementation of an SaaS capable multi-tenant, secure Server. And her= e >>>> multi-tenancy is super lightweight. Since most users have only one >>>> Template, they may have to install supporting compute engines or store= s. >>>> This is a one time issue for them and Templates should come with conta= iners >>>> and scripts to compose them. We=E2=80=99re already doing this with PIO= . A fully >>>> clustered install takes an hour. Admin of such a monster is another is= sue >>>> that is not necessarily better or even good in this model but a subjec= t for >>>> another day. >>>> >>>> >>>> On Jun 30, 2017, at 1:40 AM, Kenneth Chan wrote: >>>> >>>> I agree that there is confusion regarding event server VS event storag= e >>>> and the unclear usage definition of types of data storage (e.g. meta= -data >>>> vs model) >>>> but i'm not sure if bundling Event Server with Engine Server (or Pat >>>> calls it PredictionServer) is a good solution. >>>> >>>> currently PIO has 3 "types" of storage >>>> - METADATA : store PIO's administrative data ("Apps", etc) >>>> - EVENTDATA: store the pure events >>>> - MODELDATA : store the model >>>> >>>> 1. one confusion is when universal recommendation is used, >>>> Elastichsearch is required in order to serve the Predicted Results. Is= this >>>> type of storage considered as "MODELDATA" or "METADATA" or should intr= oduce >>>> a new type of storage for "Serving" purpose (which can be tied to engi= ne >>>> specific) ? >>>> >>>> >>>> 2. question regarding the problem described in ticket >>>> https://issues.apache.org/jira/browse/PIO-96 >>>> >>>> ``` >>>> Problems emerge when a developer tries running multiple engines with >>>> different storage configs on the same underlying database, such as: >>>> >>>> - a Classifier with *Postgres* meta, event, & model storage, and >>>> - the Universal Recommender with *Elasticsearch* meta plus >>>> *Postgres* event & model storage. >>>> >>>> ``` >>>> >>>> why user want to use different storage config for different engine? ca= n >>>> the classifier match the same configuration as universal recommender? >>>> because i thought the storage configuration is more tied to PIO as a >>>> whole rather than per engine. >>>> >>>> Kenneth >>>> >>>> >>>> >>>> >>>> On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel >>>> wrote: >>>> >>>>> Are you asking about the EventServer or PredictionServer? The >>>>> EventServer is multi-tenant with access keys, not really pure REST. W= e >>>>> (ActionML) did a hack for a client to The PredictionServer to allow A= ctors >>>>> to respond on the same port for several engine queries. We used REST >>>>> addressing for this, which adds yet another id. This makes for one pr= ocess >>>>> for the EventServe and one for the PredictionServer. Each responding = engine >>>>> was behind an Actor not a new process. So it=E2=80=99s possible but I= MO makes the >>>>> API as a total rather messy. We also had to change the workflow so me= tadata >>>>> was read on `pio deploy` so one build could then deploy many times wi= th >>>>> different engine.jsons and different PredictionServer endpoints for q= ueries >>>>> only. This comes pretty close to clean multi-tenantcy but is not SaaS >>>>> capable without solving SSL and Auth for both services. >>>>> >>>>> The hack was pretty ugly in the code and after doing that I concluded >>>>> that a big chunk needed a rewrite and hence the prototype. It depends= on >>>>> what you want but if you want SaaS I think that mean SSL + Auth + >>>>> multi-tenancy, and you also mention minimizing process boundaries. Th= ere >>>>> are rather many implications to this. >>>>> >>>>> On Jun 29, 2017, at 9:57 AM, Mars Hall wrote: >>>>> >>>>> Donald, Pat, great to hear that this is a well-pondered design >>>>> challenge of PIO =F0=9F=98=84 The prototype, composable, all-in-one s= erver sounds >>>>> promising. >>>>> >>>>> I'm wondering if there's a more immediate possibility to address >>>>> adding the `/events` REST API to Engine? Would it make sense to try >>>>> invoking an `EventServiceActor` in the tools.commands.Engine#deploy m= ethod? >>>>> If that would be a distasteful hack, just say so. I'm trying to under= stand >>>>> possibility of solving this in the current codebase vs a visionary ne= w >>>>> version of PIO. >>>>> >>>>> *Mars >>>>> >>>>> ( <> .. <> ) >>>>> >>>>> > On Jun 28, 2017, at 18:01, Pat Ferrel wrote= : >>>>> > >>>>> > Ah, one of my favorite subjects. >>>>> > >>>>> > I=E2=80=99m working on a prototype server that handles online learn= ing as >>>>> well as Lambda style. There is only one server with everything going >>>>> through REST. There are 2 resource types, Engines and Commands. Engin= es >>>>> have REST APIs with endpoints for Events and Queries. So something li= ke >>>>> POST /engines/resouce-id/events would send an event to what is like a= PIO >>>>> app and POST /engine/resource-id/queries does the PIO query equivalen= t. >>>>> Note that this is fully multi-tenant and has only one important id. I= t=E2=80=99s >>>>> based on akka-http in a fully microservice type architecture. While t= he >>>>> Server is running you can add completely new Templates for any algori= thm, >>>>> thereby adding new endpoints for Events and Queries. Each =E2=80=9Cte= nant=E2=80=9D is super >>>>> lightweight since it=E2=80=99s just an Actor not a new JVM. The CLI i= s actually >>>>> Python that hits the REST API with a Python SDK, and there is a Java = SDK >>>>> too. We support SSL and OAuth2 so having those baked into an SDK is r= eally >>>>> important. Though a prototype it can support multi-tenant SaaS. >>>>> > >>>>> > We have a prototype online learner Template which does not save >>>>> events at all though it ingests events exactly like PIO in the same f= ormat >>>>> in fact we have the same template for both servers taking identical i= nput. >>>>> Instead of an EventServer it mirrors received events events before >>>>> validation (yes we have full event validation that is template specif= ic.) >>>>> This allows some events to affect mutable data in a database and some= to >>>>> just be an immutable stream or even be thrown away for Kappa learners= . For >>>>> an online learner, each event updates the model, which is stored >>>>> periodically as a watermark. If you want to change algo params you de= stroy >>>>> the engine instance and replay the mirrored events. For a Lambda lear= ner >>>>> the Events may be stored like PIO. >>>>> > >>>>> > This is very much along the lines of the proposal I put up for >>>>> future PIO but the philosophy internally is so different that I=E2=80= =99m now not >>>>> sure how it would fit. I=E2=80=99d love to talk about it sometime and= once we do a >>>>> Lambda Template we=E2=80=99ll at least have some nice comparisons to = make. We >>>>> migrated the Kappa style Template to it so we have a good idea that i= t=E2=80=99s >>>>> not that hard. I=E2=80=99d love to donate it to PIO but only if it ma= kes sense. >>>>> > >>>>> > >>>>> > On Jun 28, 2017, at 4:27 PM, Donald Szeto wrote= : >>>>> > >>>>> > Hey Mars, >>>>> > >>>>> > Thanks for the suggestion and I agree with your point on the >>>>> metadata part. Essentially I think the app and channel concept should= be >>>>> instead logically grouped together with event, not metadata. >>>>> > >>>>> > I think in some advanced use cases, event storage should not even b= e >>>>> a hard requirement as engine templates can source data differently. I= n the >>>>> long run, it might be cleaner to have event server (and all relevant >>>>> concepts such as its API, access keys, apps, etc) as a separable pack= age, >>>>> that is by default turned on, embedded to engine server. Advanced use= rs can >>>>> either make it standalone or even turn it off completely. >>>>> > >>>>> > I imagine this kind of refactoring would echo Pat's proposal on >>>>> making a clean and separate engine and metadata management system dow= n the >>>>> road. >>>>> > >>>>> > Regards, >>>>> > Donald >>>>> > >>>>> > On Wed, Jun 28, 2017 at 3:29 PM Mars Hall wrote: >>>>> > One of the ongoing challenges we face with PredictionIO is the >>>>> separation of Engine & Eventserver APIs. This separation leads to sev= eral >>>>> problems: >>>>> > >>>>> > 1. Deploying a complete PredictionIO app requires multiple >>>>> processes, each with its own network listener >>>>> > 2. Eventserver & Engine must be configured to share exactly the sam= e >>>>> storage backends (same `pio-env.sh`) >>>>> > 3. Confusion between "Eventserver" (an optional REST API) & "event >>>>> storage" (a required database) >>>>> > >>>>> > These challenges are exacerbated by the fact that PredictionIO's >>>>> docs & `pio app` CLI make it appear that sharing an Eventserver betwe= en >>>>> Engines is a good idea. I recently filed a JIRA issue about this topi= c. >>>>> TL;DR sharing an eventserver between engines with different Meta Stor= age >>>>> config will cause data corruption: >>>>> > https://issues.apache.org/jira/browse/PIO-96 >>>>> > >>>>> > >>>>> > I believe a lot of these issues could be alleviated with one change >>>>> to PredictionIO core: >>>>> > >>>>> > By default, expose the Eventserver API from the `pio deploy` Engine >>>>> process, so that it is not necessary to deploy a second Eventserver-o= nly >>>>> process. Separate `pio eventserver` could still be optional if you ne= ed the >>>>> separation of concerns for scalability. >>>>> > >>>>> > >>>>> > I'd love to hear what you folks think. I will file a JIRA >>>>> enhancement issue if this seems like an acceptable approach. >>>>> > >>>>> > *Mars Hall >>>>> > Customer Facing Architect >>>>> > Salesforce Platform / Heroku >>>>> > San Francisco, California >>>>> > >>>>> > >>>>> >>>>> >>>>> >>>> >>>> >>> >> >> > > --94eb2c13f228a7924b05540034a9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
"I understand what y= ou are aiming for=E2=80=94namely data independence from model and engine=E2= =80=94but it is impossible and seems a very odd place to abstract when you = put it in real terms. A Recommender will never need the same data as a neur= al net, a clusterer, or a classifier. This abstraction does not exist in th= e data because it is not there in the algorithm and should not be forced aw= ay from the Engine."

<= /span>
The data belongs to application, not engine.=C2=A0Engine should use the data to train model.=C2=A0

I can train clustering model based on user beh= avior events to cluster my user, the same events can be used by recommendat= ion engine. i can create another engine to classify my user's intent ba= sed on events generated by my application. I can create a neural net based = on the product description for NLP purpose. etc=C2=A0



=

On Mo= n, Jul 10, 2017 at 8:10 AM, Pat Ferrel <pat@occamsmachete.com><= /span> wrote:
Good to know but if there is an event blocker and sniffer then the= y should be a concern of the Engine. Otherwise you are hiding Engine specif= ics from the Engine. The most irrefutable need for the =E2=80=9Cinput=E2=80= =9D method is kappa requirements and Lambda need for realtime changes to th= e model.

I understand what you are aiming for=E2=80=94na= mely data independence from model and engine=E2=80=94but it is impossible a= nd seems a very odd place to abstract when you put it in real terms. A Reco= mmender will never need the same data as a neural net, a clusterer, or a cl= assifier. This abstraction does not exist in the data because it is not the= re in the algorithm and should not be forced away from the Engine.

BTW the way the prototype server handles this data indepen= dence is allowing the user to ignore the engine (which may be under tuning = or development and not reliable for validation) and simply mirroring un-val= idated events (PIO has this built into some client SDKs but this suffers fr= om getting only a single clients events). Then they can be replayed or modi= fied as with exported PIO events. The server also imports these maintaining= event level compatibility with PIO. This even works with Kappa. If you wan= t to re-create a kappa model you simply replay the mirrored events. But mir= roring is optional and likely to be turned off once the Engine is running c= orrectly. IMO it is a more flexible model than forcing data independence aw= ay from the Engine and maintaining it into the storage layer.
So far I=E2=80=99ve written 3 PIO Templates from scratch, the U= R, The Contextual Bandit (MAB type online learner), and the db-cleaner. Wha= t I have found with these rather different algorithms is:
1) PIO = works ok with the UR but could use realtime validation and a better way of = dropping old events.
2) Kappa doesn=E2=80=99t work well at all wi= th PIO but does with the prototype server
3) event/dataset compat= ibility can be maintained between PIO and prototype Engines.
4) t= here is no need for a db-cleaner in the prototype. The Engine persists muta= ble objects and makes realtime changes to their state, and event streams ca= n be handled as the Engine needs (Kappa discards without storing, Lambda ma= y store) but since they are separate from $set, $unset these streams can ha= ve db TTLs to age out old data for Lambda Engines. The system is always sel= f-cleaning with no heavyweight operation required to keep just the right da= ta (the db cleaner is heavyweight and slow), the data does not grow forever= by design. This was never addressed as a design requirement for PIO and th= e add-on we did is not a very good solution.


On Jul 9, 2017, at 7:09 PM, Kenneth Ch= an <kenneth@apac= he.org> wrote:

i think there is a philosophical di= scussion:
1) as PIO user, should i collect my event data based on= my application uniqueness and ML needs (of course, i can use the template = format as reference), then create engine or modify engine template to use t= hese data to train model=C2=A0
or=C2=A0
2) as PIO user,= because i'm using this specific engine template, i must import and tra= nsform my data into the exact format required by template, and send to even= t server in order to make it work.

however, regard= less of above, PIO event server currently supports "event blocker"= ; and "event sniffer" to solve these issues you mentioned
1) "event blocker" can be used for "event validation in r= eal time" - the engine template can provide a sample event blocker imp= lementation and can be used to reject improper events.

=
2) "event sniffer" can be used for "forwarding specific= event to other processing system in real time" - the engine template = can also provide a sample sniffer (e.g. send to UR's elasticsearch to u= pdate meta data)=C2=A0

for advanced user, they= can modify these based on their application needs (say, if they have multi= ple engines). for starter, they may use out of the box along with template.=



On Sun, Jul 9, 2017 at 5:28 PM, = Pat Ferrel <pat@occamsmachete.com> wrote:
I must disagree her= e, The Engine should decide the disposition of data, which cannot be left t= o a generic EventServer. Data is the concern of the Engine, not the EventSe= ver or PIO framework for these reasons:

1) input needs to be v= alidated and since it is defined by the Engine it seems rather obvious that= the Engine must provide an =E2=80=9Cinput=E2=80=9D method like the =E2=80= =9Cpredict=E2=80=9D method. This input method parses and validates input re= sponding with errors of format that only it knows about. It also decides=E2= =80=A6
2) a Kappa learner must get data in realtime, and do not s= ave datasets, only buffers of data at most.
3) Kappa and even som= e Lambda algorithms need to modify/update the model in realtime. Realtime m= odel updates define "Kappa online learners" but there are also La= mbda learners like the UR that need to update parts of the model when, for = instance, item attributes change (out of stock, ...) As PIO stands now this= can only be done at train time which is a rather troublesome limitation.
4) It is the Engine=E2=80=99s concern, whether input modifies muta= ble or immutable data. One engine may use a named event to do something but= the name of the event is only know by the Engine. So if you agree that dat= a come in 2 forms, only the Engine can define and enforce this.
<= br>
This is certainly not to denigrate the EventStore, which is m= ost certainly required by every existing PIO Lambda Engine. But it should b= e the concern of the Engine how it is used and the only way to do this is m= ake =E2=80=9Cinput=E2=80=9D the concern of the Engine. This can be done gen= erically if there is truly no validation beyond the current an so does not = needlessly complicate Engines.

I am also not argui= ng for a different encoding of data. The PIO event JSON is quite flexible a= nd I have not seen a need to alter it. However because of its flexibility t= he EventServer cannot really validate it. The PIO events are even quite suf= ficient for Lambda and Kappa data encoding in fact we have a Lambda Templat= e in PIO that we made into a Kappa Template with the prototype server and u= sed exactly the same event encoding. Since the prototype requires that the = Engine validate it and respond to the input request, we immediately found e= vent encoding errors that were very serious and had been in the client for = a long time but since the events looked perfectly fine to the PIO EventServ= er, the errors were never detected and the data was in fact ignored. Within= a day of replaying exported PIO events to the prototype server the issue w= as resolved and fixed in the client.


On Jul 8, 2017, at 12:48 AM, K= enneth Chan <ken= neth@apache.org> wrote:

re: "= bundling event server as engine"

depending on how w= e wanna separate the concern.

the way i look at it= is decouple 1, data collection service (PIO event server) and 2. modeling = and prediction service (PIO engine) - that's the separation of concern.=

Ideally data is agnostic to engine, and should be= tied to user application.
The original vision is user collect da= ta, then can create multiple PIO engines which use the collected data.
if combine 1 and 2, how could user create engine A and engine B to tr= ain model on collected data for different ML use case?

=
for your input data problem, maybe other way is that the template shou= ld also provide a "event validator" which can be loaded into even= t server and advanced user can also customize it.

=


On Sat, Jul 8, 2017 at 12:31 AM, Kenneth Chan <kenneth@apac= he.org> wrote:
# re: "=C2=A0I see it as objects= you see it as data stores"

not really. I se= e things based on what functionality and purpose it provides. like you ment= ioned - The way Elasticseach is used in UR is part of the model and where t= he algorithm write the computation result into and then used as serving. In= a way, it's the model. just a more complex model than a simple linear = regression function.
If we define "Model" as output of = the train() function, then UR is storing the model into Elasticsearch - and= it is required because UR relies on Elasticsearch computation - meaning it= 's part of UR's "model".predict()


# re: =C2=A0"In real= ity the input comes in 2 types, persistent mutable objects and immutable st= reams of events (that may well be usable as a time window of data, dropping= old events)"

like you said, basically t= here are two types of data type=C2=A0
1. mutable object (e.g meta data of a product, user profile, e= tc)=C2=A0
2. immutable ev= ent (e.g. behavior data)
=
However, 1 can be co= nsidered as 2 if we treat the "changes" of mutable object as &quo= t;event" as well - basically this's the current event server desig= n.

= But i agree some use case may not care abo= ut changes of mutable object - for this, we can provide some API/option for= people to store mutable objects and always overwrite. or use better storag= e structure to capture the changes of mutable object.










On Fri, Jun 30, 2017 at 5:29 AM, Pat Ferrel <pat@oc= camsmachete.com> wrote:
Actually I think it=E2=80=99s a great solu= tion. The question about different storage config (https://issues.apache.or= g/jira/browse/PIO-96)=C2=A0is because Elasticsearch performs the l= ast step of the algorithm, it is not just a store for models, so it=E2=80= =99s an integral part of the compute engine, not the storage. If it looks t= hat way I hardly think it matters in the way implied (see below where Templ= ates should come with compassable containers). This is actually the primary= difference in the way you and I look at the problem. I see it as objects y= ou see it as data stores. Let=E2=80=99s add the question of compute backend= s and unfortunately users will have to pick the solution along with the eng= ines they require (TensorFlow anyone?) If PIO is going to be a viable ML/AI= server in the long term it has to be a lot more flexible, not less so. In = the proto server I mention, the Engine decides on the compute backend and t= he example Template does not use Spark.=C2=A0

The p= rototype server I mentioned actually only handles metadata, installs engine= s, and mirrors input. To handle Kappa as well as Lambda algorithms the Engi= ne must decide what and if it needs to store. Therefore instead of assuming= an EventServer we have mirroring of un-validated events. This has many ben= efits. For one thing we can require validation from the Engine with every e= vent. This is because the single most frequent mistake by users I=E2=80=99v= e dealt with is malformed input. PIO=E2=80=99s input scheme is great becaus= e it is so flexible but because of that validation is nil. I have seen user= s that have been using a Template for a year without understanding that mos= t of their data was ignored by the Template code (not the UR in this case) = . I have spent literally thousands of hours helping correct bad input over = email even though the UR has orders of magnitude better docs than any other= Template. Yes, it=E2=80=99s also a lot more complicated but anyway, I=E2= =80=99m tired of this=E2=80=94we need validation of every input. Then maybe= I will only spend 90% of those hours :-P

Anyway I think= the separation of concerns should be Server=C2=A0handles metadata, install= s engines, and mirrors input. The Template framework provides required APIs= for Engines that must be implemented and a set of Tools they can use or ig= nore to use what ever they need. If the Engines provides an input method th= ey can validate and if they are Kappa, learn immediately (update models in = real time), if they are Lambda, store the valid data using something like a= n Event Store. The train method is then optional and, of course, query.

BTW the reason I call it a PredictionServer (in PIO) = is because it is not an Engine Server, all it does is provide a query endpo= int. This corresponds to only one method of an Engine and there is no reaso= n to look at a query endpoint any differently than the other public APIs of= the Engine.

I guess I look at this in an object o= riented way, not a data oriented way. This leads to Template code/Engines m= aking more decisions. The Kappa template we have for this proto server neve= r uses Spark. Why would it to implement Kappa online learning? It also does= not need an Event Store because it only stores models. This is also fine f= or Lambda where an Event Store is required because the Engine provides the = input method too, where it can make the store/no-store decision.
=
This has other benefits. Treating input as an immutable stre= am has some major flaws. Some of the data has to be dropped, we cannot stor= e forever=E2=80=94no one can afford that much disk. And some data can never= be dropped because only the aggregate of all object changes makes any sens= e. In reality the input comes in 2 types, persistent mutable objects and im= mutable streams of events (that may well be usable as a time window of data= , dropping old events). With the above split, the mirror always has all inp= ut in case it=E2=80=99s needed, the Engine can decide what events operate o= n mutable objects and store the rest as a stream in the Event Store (with T= TL for time windows). Once this is trusted to work correctly mirroring can = be stopped. In fact the mutable objects can affect the model in real time n= ow, even with Lambda Templates like the UR. When an object property changes= in today=E2=80=99s PIO we have to wait till train before the model changes= because the Engine does not have an input method. If it did, then input th= at should affect the model can.

This solves all my= pet peeves, internal API-wise, and allows one implementation of an SaaS ca= pable multi-tenant, secure Server. And here multi-tenancy is super lightwei= ght. Since most users have only one Template, they may have to install supp= orting compute engines or stores. This is a one time issue for them and Tem= plates should come with containers and scripts to compose them. We=E2=80=99= re already doing this with PIO. A fully clustered install takes an hour. Ad= min of such a monster is another issue that is not necessarily better or ev= en good in this model but a subject for another day.
=


On Jun 30, 2017, at 1:40 AM, Kenneth Chan= <kenneth@apache= .org> wrote:

I agree that there is confusion regarding event server = VS event storage =C2=A0and =C2=A0the unclear usage definition of types of d= ata storage (e.g. meta-data vs model)
but i'm not sure if bundling = Event Server with Engine Server (or Pat calls it PredictionServer) =C2=A0is= a good solution.

currently PIO has 3 "types&= quot; of storage
- METADATA =C2=A0: store PIO's administrativ= e data ("Apps", etc)
-=C2=A0EVENTDATA: store the pu= re events
-=C2=A0MODELDATA : store the model

=
1. one confusion is when universal recommendation is used, Elastichsea= rch is required in order to serve the Predicted Results. Is this type of st= orage considered as "MODELDATA" or "METADATA" or should= introduce a new type of storage for "Serving" purpose (which can= be tied to engine specific) ?


2. q= uestion regarding the problem described in ticket=C2=A0=C2=A0=C2=A0https://issues.apache.org/jira/browse/PIO-96

```
=C2=A0Problems emerge when = a developer tries running multiple engines with different storage configs o= n the same underlying database, such as:
  • a = Classifier with=C2=A0Postgres=C2=A0meta, event, & model storage,= and
  • the Universal Recommender with=C2=A0Elasticsearch=C2=A0= meta plus=C2=A0Postgres=C2=A0event & model storage.
```

why user want to = use different storage config for different engine? can the classifier match= the same configuration as universal recommender?
because i thoug= ht the storage configuration is more tied to PIO as a whole rather than per= engine.

Kenneth




On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel <pat@occamsmachete.= com> wrote:
Are you asking = about the EventServer or PredictionServer? The EventServer is multi-tenant = with access keys, not really pure REST. We (ActionML) did a hack for a clie= nt to The PredictionServer to allow Actors to respond on the same port for = several engine queries. We used REST addressing for this, which adds yet an= other id. This makes for one process for the EventServe and one for the Pre= dictionServer. Each responding engine was behind an Actor not a new process= . So it=E2=80=99s possible but IMO makes the API as a total rather messy. W= e also had to change the workflow so metadata was read on `pio deploy` so o= ne build could then deploy many times with different engine.jsons and diffe= rent PredictionServer endpoints for queries only. This comes pretty close t= o clean multi-tenantcy but is not SaaS capable without solving SSL and Auth= for both services.

The hack was pretty ugly in the code and after doing that I concluded that = a big chunk needed a rewrite and hence the prototype. It depends on what yo= u want but if you want SaaS I think that mean SSL + Auth + multi-tenancy, a= nd you also mention minimizing process boundaries. There are rather many im= plications to this.

On Jun 29, 2017, at 9:57 AM, Mars Hall <mars@heroku.com> wrote:

Donald, Pat, great to hear that this is a well-pondered design challenge of= PIO =F0=9F=98=84 The prototype, composable, all-in-one server sounds promi= sing.

I'm wondering if there's a more immediate possibility to address ad= ding the `/events` REST API to Engine? Would it make sense to try invoking = an `EventServiceActor` in the tools.commands.Engine#deploy method? If that = would be a distasteful hack, just say so. I'm trying to understand poss= ibility of solving this in the current codebase vs a visionary new version = of PIO.

*Mars

( <> .. <> )

> On Jun 28, 2017, at 18:01, Pat Ferrel <pat@occamsmachete.com> wrote:
>
> Ah, one of my favorite subjects.
>
> I=E2=80=99m working on a prototype server that handles online learning= as well as Lambda style. There is only one server with everything going th= rough REST. There are 2 resource types, Engines and Commands. Engines have = REST APIs with endpoints for Events and Queries. So something like POST /en= gines/resouce-id/events would send an event to what is like a PIO app and P= OST /engine/resource-id/queries does the PIO query equivalent. Note that th= is is fully multi-tenant and has only one important id. It=E2=80=99s based = on akka-http in a fully microservice type architecture. While the Server is= running you can add completely new Templates for any algorithm, thereby ad= ding new endpoints for Events and Queries. Each =E2=80=9Ctenant=E2=80=9D is= super lightweight since it=E2=80=99s just an Actor not a new JVM. The CLI = is actually Python that hits the REST API with a Python SDK, and there is a= Java SDK too. We support SSL and OAuth2 so having those baked into an SDK = is really important. Though a prototype it can support multi-tenant SaaS. >
> We have a prototype online learner Template which does not save events= at all though it ingests events exactly like PIO in the same format in fac= t we have the same template for both servers taking identical input. Instea= d of an EventServer it mirrors received events events before validation (ye= s we have full event validation that is template specific.) This allows som= e events to affect mutable data in a database and some to just be an immuta= ble stream or even be thrown away for Kappa learners. For an online learner= , each event updates the model, which is stored periodically as a watermark= . If you want to change algo params you destroy the engine instance and rep= lay the mirrored events. For a Lambda learner the Events may be stored like= PIO.
>
> This is very much along the lines of the proposal I put up for future = PIO but the philosophy internally is so different that I=E2=80=99m now not = sure how it would fit. I=E2=80=99d love to talk about it sometime and once = we do a Lambda Template we=E2=80=99ll at least have some nice comparisons t= o make. We migrated the Kappa style Template to it so we have a good idea t= hat it=E2=80=99s not that hard. I=E2=80=99d love to donate it to PIO but on= ly if it makes sense.
>
>
> On Jun 28, 2017, at 4:27 PM, Donald Szeto <donald@apache.org> wrote:
>
> Hey Mars,
>
> Thanks for the suggestion and I agree with your point on the metadata = part. Essentially I think the app and channel concept should be instead log= ically grouped together with event, not metadata.
>
> I think in some advanced use cases, event storage should not even be a= hard requirement as engine templates can source data differently. In the l= ong run, it might be cleaner to have event server (and all relevant concept= s such as its API, access keys, apps, etc) as a separable package, that is = by default turned on, embedded to engine server. Advanced users can either = make it standalone or even turn it off completely.
>
> I imagine this kind of refactoring would echo Pat's proposal on ma= king a clean and separate engine and metadata management system down the ro= ad.
>
> Regards,
> Donald
>
> On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <mars@heroku.com> wrote:
> One of the ongoing challenges we face with PredictionIO is the separat= ion of Engine & Eventserver APIs. This separation leads to several prob= lems:
>
> 1. Deploying a complete PredictionIO app requires multiple processes, = each with its own network listener
> 2. Eventserver & Engine must be configured to share exactly the sa= me storage backends (same `pio-env.sh`)
> 3. Confusion between "Eventserver" (an optional REST API) &a= mp; "event storage" (a required database)
>
> These challenges are exacerbated by the fact that PredictionIO's d= ocs & `pio app` CLI make it appear that sharing an Eventserver between = Engines is a good idea. I recently filed a JIRA issue about this topic. TL;= DR sharing an eventserver between engines with different Meta Storage confi= g will cause data corruption:
>=C2=A0 https://issues.apache.org/jira/browse/PI= O-96
>
>
> I believe a lot of these issues could be alleviated with one change to= PredictionIO core:
>
> By default, expose the Eventserver API from the `pio deploy` Engine pr= ocess, so that it is not necessary to deploy a second Eventserver-only proc= ess. Separate `pio eventserver` could still be optional if you need the sep= aration of concerns for scalability.
>
>
> I'd love to hear what you folks think. I will file a JIRA enhancem= ent issue if this seems like an acceptable approach.
>
> *Mars Hall
> Customer Facing Architect
> Salesforce Platform / Heroku
> San Francisco, California
>
>










--94eb2c13f228a7924b05540034a9--