predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Chiu <br...@snaptee.co>
Subject Re: How to training and deploy on different machine?
Date Thu, 21 Sep 2017 03:30:22 GMT
Dear Pat,

Thanks for the detailed guide.  It is nice to know it is possible.
But I am not sure if I understand it correctly, so could you please
point out any misunderstanding in the following?  (If there is any)

====
Let's say I have 3 machines.

There is a machine [EventServer and data store) for ES, HBase+HDFS (or
Postgres, but not recommended)
The other 2 machines will both connect to this machine.
It is permanent.

machine [TrainingServer] will run `pio build` and `pio train`
This step pull training data from [EventServer] and then store model
and metadata back,
It is not permanent.

machine [PredictionServer] gets a copy of the template from machine
[TrainingServer] (only need to do this once)
Then run `pio deploy`
It is not a Spark driver or executor for training
Write a cron job of `pio deploy`
It is permanent.
====

Thanks

Brian

On Wed, Sep 20, 2017 at 11:16 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
> Yes, this is the recommended config (Postgres is not, but later). Spark is
> only needed during training but the `pio train` process creates drives and
> executors in Spark. The driver will be the `pio train` machine so you must
> install pio on it. You should have 2 Spark machines at least because the
> driver and executor need roughly the same memory, more executors will train
> faster.
>
> You will have to spread the pio “workflow” out over a permanent
> deploy+eventserver machine. I usually call this a combo PredictionServer and
> EventServe. These are 2 JVM processes the take events and respond to queries
> and so must be available all the time. You will run `pio eventserver` and
> `pio deploy` on this machine. the Spark driver machine will run `pio train`.
> Since no state is stored in PIO this will work because the machines get
> state from the DBs (HBase is recommended, and Elasticsearch). Install pio
> and the UR in the same location on all machines because the path to the UR
> is used by PIO to give an id to the engine (not ideal, but oh well).
>
> Once setup:
>
> Run `pio eventserver` on the permanent PS/ES machine and input your data
> into the EventServer.
> Run `pio build` on the “driver” machine and `pio train` on the same machine.
> This build the UR, puts metadata about the instance in PIO and creates the
> Spark driver, which can use a separate machine or 3 as Spark executors.
> Then copy the UR directory to the PS/ES machine and do `pio deploy` from the
> copied directory.
> Shut down the driver machine and Spark executors. For AWS “stopping" them
> means config is saved so you only pay for EBS storage. You will start them
> before the next train.
>
>
> From then on there is no need to copy the UR directory, just spin up the
> driver and any other Spark machine, do `pio train` and you are done. The
> model is automatically hot-swapped with the old one with no downtime and no
> need to re-deploy.
>
> This will only work in this order if you want to take advantage of a
> temporary Spark. PIO is installed on the PS/ES machine and the “driver”
> machine in exactly the same way connecting to the same stores.
>
> Hmm, I should write a How to for this...
>
>
>
> On Sep 20, 2017, at 3:23 AM, Brian Chiu <brian@snaptee.co> wrote:
>
> Hi,
>
> I would like to be able to train and run model on different machines.
> The reason is, on my dataset, training takes around 16GB of memory and
> deploying only needs 8GB.  In order to save money, it would be better
> if only a 8GB memory machine is used in production, and only start a
> 16GB one perhaps weekly for training.  Is it possible with
> predictionIO + universal recommender?
>
> I have done some search and found a related guide here:
> https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
> Which copy the whole template directory and then run pio deploy.  But
> in their case HBase and elasticsearch cluster are used.  In my case
> only a single machine is used with elasticsearch and postgresql.  Will
> this work?  (I am flexible about using postresql or localfs or hbase,
> but I cannot afford a cluster)
>
> Perhaps another solution to make the 16GB machine as a spark slave,
> start it before training start, and the 8GB machine will connect to
> it. Then call pio train; pio deploy on the 8GB machine.  Finally
> shutdown the 16GB machine.  But I have no idea if it can work.  And if
> yes, is there any documentation I can look into?
>
> Any other method is welcome!  Zero downtime is preferred but not necessary.
>
> Thanks in advance.
>
>
> Best Regards,
> Brian
>

Mime
View raw message