predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Eventserver API in an Engine?
Date Wed, 12 Jul 2017 22:18:27 GMT
The prototype has multiple engines of any type, that can accept any events that they understand.
There is no such “illusion” afaict. If there are a set of templates that accept the same
events this works just fine. As I said “they may even share validation code” and the new
input method can be stubbed to do no validation but just send raw events to the EventStore
so we don’t even complicate the APIs unless the Template wants the hook. Given the same
stream of events you could even have one Lambda and another Kappa Engine (PIO contextual Bandit
events = Harness Contextual Bandit events)

Unifying ES and PS *with* multi-tenancy for both allows anything from SaaS-able multi-client
deployment to single engine deployment or anything in-between in a rather elegant way IMO.

The only thing you loose is the ability to have a dataset inside the EventStore shared between
2 engines. The event stream is the sharable thing. I’m sure we can solve this but doubt
it will be of real use and will be dangerous, having potentially complex side-effects.


On Jul 12, 2017, at 1:34 PM, Kenneth Chan <kenneth@apache.org> wrote:

i don't think it's turn key or not. i think it's about if PIO is for single engine vertical
only or multi-engine sharing data.
For example, UR accept multiple events like "eventNames": ["buy", "view"] . One can create
another classification engine to use same set of events.

Understand there is difference between complexity of template, but we can't say PIO can't
run multiple engine sharing data because of existing template doesn't work together because
the template is meant to show how to do things differently -  re:  "this is demonstrably untrue.
Try it. Clustering for some templates assumes textual data, others do not. This seems so far
from my experience that your statement is baffling. The PIO event stream from one recommender
to the next is not compatible either. The E-Com engine requires $set events on items, the
UR does not. So taking UR events into the E-Com recommender would result in garbage output."
 

My concern of embedding event server in engine is 
- what problem are we solving by providing an illusion that events are only limited for one
engine?





On Wed, Jul 12, 2017 at 12:11 PM, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
wrote:
"I think to resolve Mars immediate need, we can implement embedded event server in a couple
phases. Roughly it would be wiring the existing event server in (with some refactoring) and
mark it experimental, then continue toward a clean, app-specific event server.”

This sounds reasonable but Donald, you may have a better in-house understanding of Mars’s
requirements. I would love to see a simple list.

I also like the parallel “experimental” track which allows us to do refactoring with little
disruption to existing users. This is the way Mahout went from a loose collection of Hadoop
Mapreduce algorithms to a completely new codebase that is a platform neutral (but primarily
Spark based) general optimized massively scalable linear algebra solving engine. 

I'll offer to donate what we are calling Harness (previously pio-kappa). It is prototype code
but implements most all of AMLs design goals as mentioned in this and other threads. It’s
implementation is fully functional for a single Kappa-style engine and so Lambda support is
only stubbed out. Integration of the PIO EventStore is not done—data storage is not abstracted
yet. It implements app centric Templates but in a fully multi-tenant secure manner. The most
solid part is the microservice based mutli-tenant rest-server with accompanying Python CLI
along with Java and Python SDKs. Not sure if it applies to the short term needs Mars has.

If you read these docs do not come with PIO workflow preconceptions https://github.com/actionml/harness
<https://github.com/actionml/harness>


On Jul 12, 2017, at 9:53 AM, Donald Szeto <donald@apache.org <mailto:donald@apache.org>>
wrote:

Many good discussions. Let me provide my input on these issues.

Multiple installations of PredictionIO should use different database names. An analogy would
be Wordpress installations that expect its own metadata database. I understand the downside
to this is that some users only have access to one database. We can add database table prefixing
support to alleviate this like most other projects do. I agree it is not very clear in the
documentation that installations of PIO should not be backed by overlapping data stores.

Regarding the discussion of data and engine, here's what it seems to me: two directions of
data science development.

One perspective is that data collection and processing is independent from data science development.
Data are collected and organized ("apps" in PIO term). Developers go look at what's available,
explore, and develop (engines).

The other one is to provide turnkey solutions. Well crafted engines expect certain inputs
and expose knobs for tuning.

PIO supports both styles today. Apps provide the grouping of data, and engine is the abstraction
to define the concern of data. These are well defined from day 1.

Side track: a confusion I feel here is that templates have different degree of sophistication.
The universal recommender is definitely much more sophisticated and turnkey than the skeleton
template for example. We should label this in our template gallery.

Going back to Mars suggestion. If the use case is such that the engine server also collects
data used by only the engine, it feels like the right abstraction would be embedding a subset
of event server that collects data going to a single app. Recall that app name is configured
in engine.json.

I think to resolve Mars immediate need, we can implement embedded event server in a couple
phases. Roughly it would be wiring the existing event server in (with some refactoring) and
mark it experimental, then continue toward a clean, app-specific event server.

Let me know how these sound.

On Tue, Jul 11, 2017 at 1:39 PM Kenneth Chan <kenneth@apache.org <mailto:kenneth@apache.org>>
wrote:
re: 
"
when deploying multiple engines with different versions of PIO and different storage configurations
....

needing separate PIO installs regularly when testing the next release or development builds
of PIO and when evaluating engine templates or algorithms that require new, different storage
configs. Also, those in the consulting world are frequently required to keep client data separated
for all kinds of privacy & legal reasons; with the storage corruption bug I reported,
one client's data could become visible to or intermingled with another client's app.
"

when install multiple PIO separately, could you set the each PIO DataBase config to use different
table name so they don't conflict?
or bring up another VM to isolate PIO?

Donald, do you have best practice or advice if user want to install multiple PIO versions
and able to run them in the same machine?



On Tue, Jul 11, 2017 at 12:49 PM, Kenneth Chan <kenneth@apache.org <mailto:kenneth@apache.org>>
wrote:
I think we are having wrong impression that every template are supposed to work together out
of the box. 

The templates are meant to be examples and demonstration - that's why they are called template!
they are never meant to be fit into any user application right away. Each application has
its uniqueness. The template only assume a specific use case for demonstration purpose.

User can start with template for simple case but they need to modify for their final needs.

For example, the PIO classification template is only meant for demonstrating simple classification.
At the end, how to use classification is application specific. For example, one can modify
the classification to train a classifier on the same set of data used by recommendation.




On Tue, Jul 11, 2017 at 10:31 AM, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
wrote:
Understood, you have immediate practical reasons for 1 integrated deployment with the 2 endpoints.
But Apache is a do-ology, meaning those who do something win the argument as long as they
have enough consensus. I have enough experience with PIO that I have chosen to fix a lot of
issues with the prototype design, having already gone down the “quick hack” path once.
You may want to do something else if you have the resources.

I fear that my deeper changes will not get enough consensus and we may end up with a competing
ML/AI server framework some day. That is another ASF tendency. Innovations happen before going
into ASF, often not under ASF rules.

In any case—how much of your problem is workflow vs installation vs bundling of APIs? Can
you explain it more?


On Jul 11, 2017, at 9:37 AM, Mars Hall <mars@heroku.com <mailto:mars@heroku.com>>
wrote:

> On Jul 10, 2017, at 18:03, Kenneth Chan <kenneth@apache.org <mailto:kenneth@apache.org>>
wrote:
>
> it's all same set of events collected for my application and i can create multiple engine
to use these data for different purpose.


Clear to me, ⬆️ this is the prevailing reasoning behind the "separateness" of the Eventserver.
I do not foresake this design goal, but ask that we consider the usability & durability
of PredictionIO when deploying multiple engines with different versions of PIO and different
storage configurations. This will probably happen for anyone who uses PredictionIO long-term
in production, as their new projects come on-line with newer & better versions & configurations.

I encounter this situation of needing separate PIO installs regularly when testing the next
release or development builds of PIO and when evaluating engine templates or algorithms that
require new, different storage configs. Also, those in the consulting world are frequently
required to keep client data separated for all kinds of privacy & legal reasons; with
the storage corruption bug I reported, one client's data could become visible to or intermingled
with another client's app.

In starting this thread, I was hoping to find some traction with the idea of making it possible
to completely self-contain a PredictionIO app by adding the Events API to the process started
with `pio deploy`.

Goal: Queries & Events APIs in the same process.

When considering the architecture of apps, sharing a database between two or more apps is
considered a very naughty way to get around having clear, clean, inter-process API's. My team
at Salesforce/Heroku has been struck by this exact issue with PredictionIO. So, I am seeking
a way to fix this without requiring a rewrite of PredictionIO. I am excited to hear about
the new architecture prototypes, yet our reality is that this is an issue now.

*Mars

( <> .. <> )








Mime
View raw message