predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: How to transform variables?
Date Fri, 20 Jan 2017 23:37:11 GMT
I see PIO as a production big-data pipeline. It sounds like what you need is a math framework
that is pretty much interactive where you can change the function and do some cross-validation
in nearly real time. This seems to imply R, Python, of Scala + Mahout Samsara + Zeppelin.
Of these Mahout is the only interactive tool that runs on a Spark cluster backend and so can
crunch a lot of data in the interactive Scala shell. If you don’t need big-data, the others
might be more familiar.  There are lot of regression algorithms prepackaged in those and some
in PIO templates.

Then when you have the algorithm designed, put the parameters in engine.json so you won’t
have to change code to tune and put it in PIO for everyday production learning/prediction.

On Jan 20, 2017, at 10:17 AM, Daniel Gabrieli <> wrote:

Thank you. That is helpful.  More specifically, I am trying to implement a regression of a
form like this:

write_score = B0 + B2*log(math) + B3*log(read)

Where a student's predicted writing score is a function of gender, the log of a math score,
the log of a reading score. 

But in fact, what I am trying to understand is how to do feature engineering inside of PIO.
 I want to try various manipulations of the data to figure out what the best features are
for a given model (log is a common example).  I might want to try, for example, another regression

write_score = B0 + B2*(math - read)^2

Where the score on writing is a function of the squared difference between the math and reading

I'd prefer manipulate variables within the PIO Engine because the servers that send the event
data to PIO are "just dump pipes" and I'd like to keep the "data science" logic outside of
those pipes and inside of PIO as much as possible.

On Fri, Jan 20, 2017 at 12:45 PM Pat Ferrel < <>>
It would help to know what you are trying to implement.

The datasource and preparator are used only during the input part of train, they pass data
to the train method of your algorithm when you run `pio train`. The predict method does not
use them at all. It may get data from the EventStore, but not through those other classes.

If you need data to always be the log of some number you may want to take the log before it
is sent to the EventServer so it will always be a log, event when you get the Query or out
of the EventSever. 

On Jan 20, 2017, at 5:13 AM, Daniel Gabrieli < <>>


I am a new to PIO.

I have a variable called X that I would like take the log of during training and then during
prediction as well.  Where is the appropriate place to put the log function?

My guess is to override the "prepare" method; while I think the prepare method is called just
before training, I am not clear whether it is also called before prediction.

Do I call the log transformation again somewhere else so that it occurs during prediction?
 Possibly in the predict method?

Thank you,


View raw message