predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kenneth Chan <kenn...@apache.org>
Subject Re: Eventserver API in an Engine?
Date Sat, 08 Jul 2017 07:31:36 GMT
# re: " I see it as objects you see it as data stores"

not really. I see things based on what functionality and purpose it
provides. like you mentioned - The way Elasticseach is used in UR is part
of the model and where the algorithm write the computation result into and
then used as serving. In a way, it's the model. just a more complex model
than a simple linear regression function.
If we define "Model" as output of the train() function, then UR is storing
the model into Elasticsearch - and it is required because UR relies on
Elasticsearch computation - meaning it's part of UR's "model".predict()


# re:  "In reality the input comes in 2 types, persistent mutable objects
and immutable streams of events (that may well be usable as a time window
of data, dropping old events)"

like you said, basically there are two types of data type
1. mutable object (e.g meta data of a product, user profile, etc)
2. immutable event (e.g. behavior data)

However, 1 can be considered as 2 if we treat the "changes" of mutable
object as "event" as well - basically this's the current event server
design.

But i agree some use case may not care about changes of mutable object -
for this, we can provide some API/option for people to store mutable
objects and always overwrite. or use better storage structure to capture
the changes of mutable object.










On Fri, Jun 30, 2017 at 5:29 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Actually I think it’s a great solution. The question about different
> storage config (https://issues.apache.org/jira/browse/PIO-96) is because
> Elasticsearch performs the last step of the algorithm, it is not just a
> store for models, so it’s an integral part of the compute engine, not the
> storage. If it looks that way I hardly think it matters in the way implied
> (see below where Templates should come with compassable containers). This
> is actually the primary difference in the way you and I look at the
> problem. I see it as objects you see it as data stores. Let’s add the
> question of compute backends and unfortunately users will have to pick the
> solution along with the engines they require (TensorFlow anyone?) If PIO is
> going to be a viable ML/AI server in the long term it has to be a lot more
> flexible, not less so. In the proto server I mention, the Engine decides on
> the compute backend and the example Template does not use Spark.
>
> The prototype server I mentioned actually only handles metadata, installs
> engines, and mirrors input. To handle Kappa as well as Lambda algorithms
> the Engine must decide what and if it needs to store. Therefore instead of
> assuming an EventServer we have mirroring of un-validated events. This has
> many benefits. For one thing we can require validation from the Engine with
> every event. This is because the single most frequent mistake by users I’ve
> dealt with is malformed input. PIO’s input scheme is great because it is so
> flexible but because of that validation is nil. I have seen users that have
> been using a Template for a year without understanding that most of their
> data was ignored by the Template code (not the UR in this case) . I have
> spent literally thousands of hours helping correct bad input over email
> even though the UR has orders of magnitude better docs than any other
> Template. Yes, it’s also a lot more complicated but anyway, I’m tired of
> this—we need validation of every input. Then maybe I will only spend 90% of
> those hours :-P
>
> Anyway I think the separation of concerns should be Server handles
> metadata, installs engines, and mirrors input. The Template framework
> provides required APIs for Engines that must be implemented and a set of
> Tools they can use or ignore to use what ever they need. If the Engines
> provides an input method they can validate and if they are Kappa, learn
> immediately (update models in real time), if they are Lambda, store the
> valid data using something like an Event Store. The train method is then
> optional and, of course, query.
>
> BTW the reason I call it a PredictionServer (in PIO) is because it is not
> an Engine Server, all it does is provide a query endpoint. This corresponds
> to only one method of an Engine and there is no reason to look at a query
> endpoint any differently than the other public APIs of the Engine.
>
> I guess I look at this in an object oriented way, not a data oriented way.
> This leads to Template code/Engines making more decisions. The Kappa
> template we have for this proto server never uses Spark. Why would it to
> implement Kappa online learning? It also does not need an Event Store
> because it only stores models. This is also fine for Lambda where an Event
> Store is required because the Engine provides the input method too, where
> it can make the store/no-store decision.
>
> This has other benefits. Treating input as an immutable stream has some
> major flaws. Some of the data has to be dropped, we cannot store forever—no
> one can afford that much disk. And some data can never be dropped because
> only the aggregate of all object changes makes any sense. In reality the
> input comes in 2 types, persistent mutable objects and immutable streams of
> events (that may well be usable as a time window of data, dropping old
> events). With the above split, the mirror always has all input in case it’s
> needed, the Engine can decide what events operate on mutable objects and
> store the rest as a stream in the Event Store (with TTL for time windows).
> Once this is trusted to work correctly mirroring can be stopped. In fact
> the mutable objects can affect the model in real time now, even with Lambda
> Templates like the UR. When an object property changes in today’s PIO we
> have to wait till train before the model changes because the Engine does
> not have an input method. If it did, then input that should affect the
> model can.
>
> This solves all my pet peeves, internal API-wise, and allows one
> implementation of an SaaS capable multi-tenant, secure Server. And here
> multi-tenancy is super lightweight. Since most users have only one
> Template, they may have to install supporting compute engines or stores.
> This is a one time issue for them and Templates should come with containers
> and scripts to compose them. We’re already doing this with PIO. A fully
> clustered install takes an hour. Admin of such a monster is another issue
> that is not necessarily better or even good in this model but a subject for
> another day.
>
>
> On Jun 30, 2017, at 1:40 AM, Kenneth Chan <kenneth@apache.org> wrote:
>
> I agree that there is confusion regarding event server VS event storage
>  and  the unclear usage definition of types of data storage (e.g. meta-data
> vs model)
> but i'm not sure if bundling Event Server with Engine Server (or Pat calls
> it PredictionServer)  is a good solution.
>
> currently PIO has 3 "types" of storage
> - METADATA  : store PIO's administrative data ("Apps", etc)
> - EVENTDATA: store the pure events
> - MODELDATA : store the model
>
> 1. one confusion is when universal recommendation is used, Elastichsearch
> is required in order to serve the Predicted Results. Is this type of
> storage considered as "MODELDATA" or "METADATA" or should introduce a new
> type of storage for "Serving" purpose (which can be tied to engine
> specific) ?
>
>
> 2. question regarding the problem described in ticket   https://issues.
> apache.org/jira/browse/PIO-96
>
> ```
>  Problems emerge when a developer tries running multiple engines with
> different storage configs on the same underlying database, such as:
>
>    - a Classifier with *Postgres* meta, event, & model storage, and
>    - the Universal Recommender with *Elasticsearch* meta plus *Postgres* event
>    & model storage.
>
> ```
>
> why user want to use different storage config for different engine? can
> the classifier match the same configuration as universal recommender?
> because i thought the storage configuration is more tied to PIO as a whole
> rather than per engine.
>
> Kenneth
>
>
>
>
> On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel <pat@occamsmachete.com>
> wrote:
>
>> Are you asking about the EventServer or PredictionServer? The EventServer
>> is multi-tenant with access keys, not really pure REST. We (ActionML) did a
>> hack for a client to The PredictionServer to allow Actors to respond on the
>> same port for several engine queries. We used REST addressing for this,
>> which adds yet another id. This makes for one process for the EventServe
>> and one for the PredictionServer. Each responding engine was behind an
>> Actor not a new process. So it’s possible but IMO makes the API as a total
>> rather messy. We also had to change the workflow so metadata was read on
>> `pio deploy` so one build could then deploy many times with different
>> engine.jsons and different PredictionServer endpoints for queries only.
>> This comes pretty close to clean multi-tenantcy but is not SaaS capable
>> without solving SSL and Auth for both services.
>>
>> The hack was pretty ugly in the code and after doing that I concluded
>> that a big chunk needed a rewrite and hence the prototype. It depends on
>> what you want but if you want SaaS I think that mean SSL + Auth +
>> multi-tenancy, and you also mention minimizing process boundaries. There
>> are rather many implications to this.
>>
>> On Jun 29, 2017, at 9:57 AM, Mars Hall <mars@heroku.com> wrote:
>>
>> Donald, Pat, great to hear that this is a well-pondered design challenge
>> of PIO 😄 The prototype, composable, all-in-one server sounds promising.
>>
>> I'm wondering if there's a more immediate possibility to address adding
>> the `/events` REST API to Engine? Would it make sense to try invoking an
>> `EventServiceActor` in the tools.commands.Engine#deploy method? If that
>> would be a distasteful hack, just say so. I'm trying to understand
>> possibility of solving this in the current codebase vs a visionary new
>> version of PIO.
>>
>> *Mars
>>
>> ( <> .. <> )
>>
>> > On Jun 28, 2017, at 18:01, Pat Ferrel <pat@occamsmachete.com> wrote:
>> >
>> > Ah, one of my favorite subjects.
>> >
>> > I’m working on a prototype server that handles online learning as well
>> as Lambda style. There is only one server with everything going through
>> REST. There are 2 resource types, Engines and Commands. Engines have REST
>> APIs with endpoints for Events and Queries. So something like POST
>> /engines/resouce-id/events would send an event to what is like a PIO app
>> and POST /engine/resource-id/queries does the PIO query equivalent. Note
>> that this is fully multi-tenant and has only one important id. It’s based
>> on akka-http in a fully microservice type architecture. While the Server is
>> running you can add completely new Templates for any algorithm, thereby
>> adding new endpoints for Events and Queries. Each “tenant” is super
>> lightweight since it’s just an Actor not a new JVM. The CLI is actually
>> Python that hits the REST API with a Python SDK, and there is a Java SDK
>> too. We support SSL and OAuth2 so having those baked into an SDK is really
>> important. Though a prototype it can support multi-tenant SaaS.
>> >
>> > We have a prototype online learner Template which does not save events
>> at all though it ingests events exactly like PIO in the same format in fact
>> we have the same template for both servers taking identical input. Instead
>> of an EventServer it mirrors received events events before validation (yes
>> we have full event validation that is template specific.) This allows some
>> events to affect mutable data in a database and some to just be an
>> immutable stream or even be thrown away for Kappa learners. For an online
>> learner, each event updates the model, which is stored periodically as a
>> watermark. If you want to change algo params you destroy the engine
>> instance and replay the mirrored events. For a Lambda learner the Events
>> may be stored like PIO.
>> >
>> > This is very much along the lines of the proposal I put up for future
>> PIO but the philosophy internally is so different that I’m now not sure how
>> it would fit. I’d love to talk about it sometime and once we do a Lambda
>> Template we’ll at least have some nice comparisons to make. We migrated the
>> Kappa style Template to it so we have a good idea that it’s not that hard.
>> I’d love to donate it to PIO but only if it makes sense.
>> >
>> >
>> > On Jun 28, 2017, at 4:27 PM, Donald Szeto <donald@apache.org> wrote:
>> >
>> > Hey Mars,
>> >
>> > Thanks for the suggestion and I agree with your point on the metadata
>> part. Essentially I think the app and channel concept should be instead
>> logically grouped together with event, not metadata.
>> >
>> > I think in some advanced use cases, event storage should not even be a
>> hard requirement as engine templates can source data differently. In the
>> long run, it might be cleaner to have event server (and all relevant
>> concepts such as its API, access keys, apps, etc) as a separable package,
>> that is by default turned on, embedded to engine server. Advanced users can
>> either make it standalone or even turn it off completely.
>> >
>> > I imagine this kind of refactoring would echo Pat's proposal on making
>> a clean and separate engine and metadata management system down the road.
>> >
>> > Regards,
>> > Donald
>> >
>> > On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <mars@heroku.com> wrote:
>> > One of the ongoing challenges we face with PredictionIO is the
>> separation of Engine & Eventserver APIs. This separation leads to several
>> problems:
>> >
>> > 1. Deploying a complete PredictionIO app requires multiple processes,
>> each with its own network listener
>> > 2. Eventserver & Engine must be configured to share exactly the same
>> storage backends (same `pio-env.sh`)
>> > 3. Confusion between "Eventserver" (an optional REST API) & "event
>> storage" (a required database)
>> >
>> > These challenges are exacerbated by the fact that PredictionIO's docs &
>> `pio app` CLI make it appear that sharing an Eventserver between Engines is
>> a good idea. I recently filed a JIRA issue about this topic. TL;DR sharing
>> an eventserver between engines with different Meta Storage config will
>> cause data corruption:
>> >  https://issues.apache.org/jira/browse/PIO-96
>> >
>> >
>> > I believe a lot of these issues could be alleviated with one change to
>> PredictionIO core:
>> >
>> > By default, expose the Eventserver API from the `pio deploy` Engine
>> process, so that it is not necessary to deploy a second Eventserver-only
>> process. Separate `pio eventserver` could still be optional if you need the
>> separation of concerns for scalability.
>> >
>> >
>> > I'd love to hear what you folks think. I will file a JIRA enhancement
>> issue if this seems like an acceptable approach.
>> >
>> > *Mars Hall
>> > Customer Facing Architect
>> > Salesforce Platform / Heroku
>> > San Francisco, California
>> >
>> >
>>
>>
>>
>
>

Mime
View raw message