predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Eventserver API in an Engine?
Date Sun, 09 Jul 2017 23:45:45 GMT
Mostly agree with this. but if we want to do Kappa-style online learners they do not need dataset
storage but do need realtime input.


On Jul 8, 2017, at 12:31 AM, Kenneth Chan <kenneth@apache.org> wrote:

# re: " I see it as objects you see it as data stores"

not really. I see things based on what functionality and purpose it provides. like you mentioned
- The way Elasticseach is used in UR is part of the model and where the algorithm write the
computation result into and then used as serving. In a way, it's the model. just a more complex
model than a simple linear regression function.
If we define "Model" as output of the train() function, then UR is storing the model into
Elasticsearch - and it is required because UR relies on Elasticsearch computation - meaning
it's part of UR's "model".predict()


# re:  "In reality the input comes in 2 types, persistent mutable objects and immutable streams
of events (that may well be usable as a time window of data, dropping old events)"

like you said, basically there are two types of data type 
1. mutable object (e.g meta data of a product, user profile, etc) 
2. immutable event (e.g. behavior data)

However, 1 can be considered as 2 if we treat the "changes" of mutable object as "event" as
well - basically this's the current event server design.

But i agree some use case may not care about changes of mutable object - for this, we can
provide some API/option for people to store mutable objects and always overwrite. or use better
storage structure to capture the changes of mutable object.










On Fri, Jun 30, 2017 at 5:29 AM, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
wrote:
Actually I think it’s a great solution. The question about different storage config (https://issues.apache.org/jira/browse/PIO-96
<https://issues.apache.org/jira/browse/PIO-96>) is because Elasticsearch performs the
last step of the algorithm, it is not just a store for models, so it’s an integral part
of the compute engine, not the storage. If it looks that way I hardly think it matters in
the way implied (see below where Templates should come with compassable containers). This
is actually the primary difference in the way you and I look at the problem. I see it as objects
you see it as data stores. Let’s add the question of compute backends and unfortunately
users will have to pick the solution along with the engines they require (TensorFlow anyone?)
If PIO is going to be a viable ML/AI server in the long term it has to be a lot more flexible,
not less so. In the proto server I mention, the Engine decides on the compute backend and
the example Template does not use Spark. 

The prototype server I mentioned actually only handles metadata, installs engines, and mirrors
input. To handle Kappa as well as Lambda algorithms the Engine must decide what and if it
needs to store. Therefore instead of assuming an EventServer we have mirroring of un-validated
events. This has many benefits. For one thing we can require validation from the Engine with
every event. This is because the single most frequent mistake by users I’ve dealt with is
malformed input. PIO’s input scheme is great because it is so flexible but because of that
validation is nil. I have seen users that have been using a Template for a year without understanding
that most of their data was ignored by the Template code (not the UR in this case) . I have
spent literally thousands of hours helping correct bad input over email even though the UR
has orders of magnitude better docs than any other Template. Yes, it’s also a lot more complicated
but anyway, I’m tired of this—we need validation of every input. Then maybe I will only
spend 90% of those hours :-P

Anyway I think the separation of concerns should be Server handles metadata, installs engines,
and mirrors input. The Template framework provides required APIs for Engines that must be
implemented and a set of Tools they can use or ignore to use what ever they need. If the Engines
provides an input method they can validate and if they are Kappa, learn immediately (update
models in real time), if they are Lambda, store the valid data using something like an Event
Store. The train method is then optional and, of course, query.

BTW the reason I call it a PredictionServer (in PIO) is because it is not an Engine Server,
all it does is provide a query endpoint. This corresponds to only one method of an Engine
and there is no reason to look at a query endpoint any differently than the other public APIs
of the Engine.

I guess I look at this in an object oriented way, not a data oriented way. This leads to Template
code/Engines making more decisions. The Kappa template we have for this proto server never
uses Spark. Why would it to implement Kappa online learning? It also does not need an Event
Store because it only stores models. This is also fine for Lambda where an Event Store is
required because the Engine provides the input method too, where it can make the store/no-store
decision.

This has other benefits. Treating input as an immutable stream has some major flaws. Some
of the data has to be dropped, we cannot store forever—no one can afford that much disk.
And some data can never be dropped because only the aggregate of all object changes makes
any sense. In reality the input comes in 2 types, persistent mutable objects and immutable
streams of events (that may well be usable as a time window of data, dropping old events).
With the above split, the mirror always has all input in case it’s needed, the Engine can
decide what events operate on mutable objects and store the rest as a stream in the Event
Store (with TTL for time windows). Once this is trusted to work correctly mirroring can be
stopped. In fact the mutable objects can affect the model in real time now, even with Lambda
Templates like the UR. When an object property changes in today’s PIO we have to wait till
train before the model changes because the Engine does not have an input method. If it did,
then input that should affect the model can.

This solves all my pet peeves, internal API-wise, and allows one implementation of an SaaS
capable multi-tenant, secure Server. And here multi-tenancy is super lightweight. Since most
users have only one Template, they may have to install supporting compute engines or stores.
This is a one time issue for them and Templates should come with containers and scripts to
compose them. We’re already doing this with PIO. A fully clustered install takes an hour.
Admin of such a monster is another issue that is not necessarily better or even good in this
model but a subject for another day.


On Jun 30, 2017, at 1:40 AM, Kenneth Chan <kenneth@apache.org <mailto:kenneth@apache.org>>
wrote:

I agree that there is confusion regarding event server VS event storage  and  the unclear
usage definition of types of data storage (e.g. meta-data vs model)
but i'm not sure if bundling Event Server with Engine Server (or Pat calls it PredictionServer)
 is a good solution.

currently PIO has 3 "types" of storage
- METADATA  : store PIO's administrative data ("Apps", etc)
- EVENTDATA: store the pure events
- MODELDATA : store the model

1. one confusion is when universal recommendation is used, Elastichsearch is required in order
to serve the Predicted Results. Is this type of storage considered as "MODELDATA" or "METADATA"
or should introduce a new type of storage for "Serving" purpose (which can be tied to engine
specific) ?


2. question regarding the problem described in ticket   https://issues.apache.org/jira/browse/PIO-96
<https://issues.apache.org/jira/browse/PIO-96>

```
 Problems emerge when a developer tries running multiple engines with different storage configs
on the same underlying database, such as:
a Classifier with Postgres meta, event, & model storage, and
the Universal Recommender with Elasticsearch meta plus Postgres event & model storage.
```

why user want to use different storage config for different engine? can the classifier match
the same configuration as universal recommender?
because i thought the storage configuration is more tied to PIO as a whole rather than per
engine.

Kenneth




On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
wrote:
Are you asking about the EventServer or PredictionServer? The EventServer is multi-tenant
with access keys, not really pure REST. We (ActionML) did a hack for a client to The PredictionServer
to allow Actors to respond on the same port for several engine queries. We used REST addressing
for this, which adds yet another id. This makes for one process for the EventServe and one
for the PredictionServer. Each responding engine was behind an Actor not a new process. So
it’s possible but IMO makes the API as a total rather messy. We also had to change the workflow
so metadata was read on `pio deploy` so one build could then deploy many times with different
engine.jsons and different PredictionServer endpoints for queries only. This comes pretty
close to clean multi-tenantcy but is not SaaS capable without solving SSL and Auth for both
services.

The hack was pretty ugly in the code and after doing that I concluded that a big chunk needed
a rewrite and hence the prototype. It depends on what you want but if you want SaaS I think
that mean SSL + Auth + multi-tenancy, and you also mention minimizing process boundaries.
There are rather many implications to this.

On Jun 29, 2017, at 9:57 AM, Mars Hall <mars@heroku.com <mailto:mars@heroku.com>>
wrote:

Donald, Pat, great to hear that this is a well-pondered design challenge of PIO 😄 The prototype,
composable, all-in-one server sounds promising.

I'm wondering if there's a more immediate possibility to address adding the `/events` REST
API to Engine? Would it make sense to try invoking an `EventServiceActor` in the tools.commands.Engine#deploy
method? If that would be a distasteful hack, just say so. I'm trying to understand possibility
of solving this in the current codebase vs a visionary new version of PIO.

*Mars

( <> .. <> )

> On Jun 28, 2017, at 18:01, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
wrote:
>
> Ah, one of my favorite subjects.
>
> I’m working on a prototype server that handles online learning as well as Lambda style.
There is only one server with everything going through REST. There are 2 resource types, Engines
and Commands. Engines have REST APIs with endpoints for Events and Queries. So something like
POST /engines/resouce-id/events would send an event to what is like a PIO app and POST /engine/resource-id/queries
does the PIO query equivalent. Note that this is fully multi-tenant and has only one important
id. It’s based on akka-http in a fully microservice type architecture. While the Server
is running you can add completely new Templates for any algorithm, thereby adding new endpoints
for Events and Queries. Each “tenant” is super lightweight since it’s just an Actor
not a new JVM. The CLI is actually Python that hits the REST API with a Python SDK, and there
is a Java SDK too. We support SSL and OAuth2 so having those baked into an SDK is really important.
Though a prototype it can support multi-tenant SaaS.
>
> We have a prototype online learner Template which does not save events at all though
it ingests events exactly like PIO in the same format in fact we have the same template for
both servers taking identical input. Instead of an EventServer it mirrors received events
events before validation (yes we have full event validation that is template specific.) This
allows some events to affect mutable data in a database and some to just be an immutable stream
or even be thrown away for Kappa learners. For an online learner, each event updates the model,
which is stored periodically as a watermark. If you want to change algo params you destroy
the engine instance and replay the mirrored events. For a Lambda learner the Events may be
stored like PIO.
>
> This is very much along the lines of the proposal I put up for future PIO but the philosophy
internally is so different that I’m now not sure how it would fit. I’d love to talk about
it sometime and once we do a Lambda Template we’ll at least have some nice comparisons to
make. We migrated the Kappa style Template to it so we have a good idea that it’s not that
hard. I’d love to donate it to PIO but only if it makes sense.
>
>
> On Jun 28, 2017, at 4:27 PM, Donald Szeto <donald@apache.org <mailto:donald@apache.org>>
wrote:
>
> Hey Mars,
>
> Thanks for the suggestion and I agree with your point on the metadata part. Essentially
I think the app and channel concept should be instead logically grouped together with event,
not metadata.
>
> I think in some advanced use cases, event storage should not even be a hard requirement
as engine templates can source data differently. In the long run, it might be cleaner to have
event server (and all relevant concepts such as its API, access keys, apps, etc) as a separable
package, that is by default turned on, embedded to engine server. Advanced users can either
make it standalone or even turn it off completely.
>
> I imagine this kind of refactoring would echo Pat's proposal on making a clean and separate
engine and metadata management system down the road.
>
> Regards,
> Donald
>
> On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <mars@heroku.com <mailto:mars@heroku.com>>
wrote:
> One of the ongoing challenges we face with PredictionIO is the separation of Engine &
Eventserver APIs. This separation leads to several problems:
>
> 1. Deploying a complete PredictionIO app requires multiple processes, each with its own
network listener
> 2. Eventserver & Engine must be configured to share exactly the same storage backends
(same `pio-env.sh`)
> 3. Confusion between "Eventserver" (an optional REST API) & "event storage" (a required
database)
>
> These challenges are exacerbated by the fact that PredictionIO's docs & `pio app`
CLI make it appear that sharing an Eventserver between Engines is a good idea. I recently
filed a JIRA issue about this topic. TL;DR sharing an eventserver between engines with different
Meta Storage config will cause data corruption:
>  https://issues.apache.org/jira/browse/PIO-96 <https://issues.apache.org/jira/browse/PIO-96>
>
>
> I believe a lot of these issues could be alleviated with one change to PredictionIO core:
>
> By default, expose the Eventserver API from the `pio deploy` Engine process, so that
it is not necessary to deploy a second Eventserver-only process. Separate `pio eventserver`
could still be optional if you need the separation of concerns for scalability.
>
>
> I'd love to hear what you folks think. I will file a JIRA enhancement issue if this seems
like an acceptable approach.
>
> *Mars Hall
> Customer Facing Architect
> Salesforce Platform / Heroku
> San Francisco, California
>
>







Mime
View raw message