predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Info / resources for scaling PIO?
Date Tue, 24 Apr 2018 17:16:08 GMT
PIO is based on the architecture of Spark, which uses HDFS. HBase also uses
HDFS. Scaling these are quite well documented on the web. Scaling PIO is
the same as scaling all it’s services. It is unlikely you’ll need it but
you can also have more than one PIO server behind a load balancer.

Don’t use local models, put them in HDFS. Don’t mess with NFS, it is not
the design point for PIO. Scaling Spark beyond one machine will require
HDFS anyway so use it.

I also advise against using ES for all storage. 4 things hit the event
storage, incoming events (input), training, where all events are read out
at high speed, optionally model storage (depending on the engine) and
queries usually hit the event storage. This will quickly overload one
service and ES is not built as an object retrieval DB. The only reason to
use ES for all storage is that it is convenient when doing development or
experimenting with engines. In production it would be risky to rely on ES
for all storage and you would still need to scale out Spark and therefore

There is a little written about various scaling models here: the the architecture and workflow
tab and there are a couple system install docs that cover scaling.

From: Adam Drew <> <>
Reply: <>
Date: April 24, 2018 at 7:37:35 AM
To: <>
Subject:  Info / resources for scaling PIO?

Hi all!

Is there any info on how to scale PIO to multiple nodes? I’ve gone through
a lot of the docs on the site and haven’t found anything. I’ve tested PIO
running with HBASE and ES for metadata and events, and with using just ES
for both (my preference thusfar) and have my models on local storage. Would
scaling simply be a matter of deploying clustered ES, and then finding some
way to share my model storage, such as NFS or HDFS? The question then is
what (if anything) has to be done for the nodes to “know” about changes on
other nodes. For example, if the model gets trained on node A does node B
automatically know about that?

I hope that makes sense. I’m coming to PIO with no prior experience for the
underlying apache bits (spark, hbase / hdfs, etc) so there’s likely things
I’m not considering. Any help / docs / guidance is appreciated.



View raw message