predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Data import, HBase requirements, and cost savings ?
Date Tue, 10 Apr 2018 20:12:26 GMT
It depends on what templates you are using. For instance the recommenders
require queries to the EventStore to get user history so this will not work
for them. Some templates do not require Spark to be running at scale except
for the training phase (The Universal Recommender for instance) so for that
template it is much more cost-effective to stop Spark when not using it.

Every template uses the PIO framework in different ways. Dropping the DB is
not likely to work, especially if you are using it to store engine metadata.

We’d need to know what templates you are using to advise cost savings.

From: Miller, Clifford <>
Reply: <>
Date: April 10, 2018 at 11:22:04 AM
To: <>
Subject:  Data import, HBase requirements, and cost savings ?

I'm exploring cost saving options for a customer that is wanting to utilize
PredictionIO.  We plan on running multiple engines/templates.  We are
planning on running everything in AWS and are hoping to not have all data
loaded for all templates at once.  The hope is to:

   1. start up the HBase cluster.
   2. Import the events.
   3. Train the model
   4. then store the model in S3.
   5. Then shutdown HBase cluster

We have some general questions.

   1. Is this approach even feasible?
   2. Does PredictionIO require the Event Store (HBase) to be up and
   running constantly or can we turn it off when not training?  If it requires
   HBase constantly can we do the training from a different HBase cluster and
   then have separate PIO Event/Engine servers to deploy the applications
   using the model generated by the larger Hbase cluster?
   3. Can the events be stored in S3 and then imported in (pio import) when
   needed for training? or will we have to copy them out of S3 to our PIO
   Event/Engine server?
   4. Has any import benchmarks been done?  Events per second or MB/GB per

Any assistance would be appreciated.


View raw message