Hi Sourav,
Couple of questions to make sure we are on same page: does the "dependent
variable (double)" represents the class labels ? Are the values of the
class labels from 1 to numClasses (i..e onebased) ?
Here are few comments regarding correlating IDs:
To represent an unordered collection (i.e. DataFrame) to an ordered
collection ("Matrix"), we add special column "ID" which represents
onebased row index. Please perform following steps:
1. Accept recent changes from https://github.com/apache/incubatorsystemml
and use the generated jar.
2. Map the unique id in DF1 to int (1 to number of rows) and call that
column 'ID'.
3. Use the variant of registerInput for both X (both for training and
predicting) and Y:
registerInput(String varName, DataFrame df, boolean containsID)
As a side note: instead of separate double columns, you can represent them
using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
inputDF, MatrixCharacteristics mcOut, boolean containsID, String
vectorColumnName) "
Thanks,
Niketan Pansare
IBM Almaden Research Center
Email: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=usnpansar
From: Sourav Mazumder <sourav.mazumder00@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 12/09/2015 11:15 AM
Subject: Re: Using GLMpredict
Hi Niketan,
The code you provided works fine. The use of getMatrixCharacteristics
solves the basic execution problem.
However, question #3 is probably not yet unresolved. Let me explain the use
case scenario I'm trying to build.
1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
columns (say 4) which are to be used as features (double), and a column for
the dependent variable (double).
2. When I created the model I created a data frame (DF2) from DF1 using
only the feature vectors and pass that as X. And the column with dependent
value is passed as Y.
3. For calling the GLMpredict I'm using another data frame (DF3) of same
structure but with different Unique ID (essentially different
records/rows). From that data frame I'm first creating another data frame
(DF4) containing the columns representing the features. Then I'm sending
DF4 to GLMpredict which has only feature vectors.
4. The response I get from GLMpredict is the 'means'. Then I'm using the
inline predict script which returns another data frame {DF5) with ID and
Predicted values.
The question is how do I correlate the ID I'm getting from DF5 with the
Unique ID of the data frame DF3 ?
Regards,
Sourav
On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <npansar@us.ibm.com> wrote:
> Hi Sourav,
>
> 1. In the GLMpredict.dml I could see 'means' is the output variable. In
my
> understanding it is same as the probability matrix u have mentioned in
your
> mail (to be used to compute the prediction). Am I right ?
> Yes, that's correct.
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
GLMpredict.dml
> as B.
>
> Can you try this ?
> // Get output from GLM
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
> don't have to worry about dimensions.
> // 
> val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> predict
> // 
> // Execute GLMpredict
> ml.reset()
> // Please read
>
https://github.com/apache/incubatorsystemml/blob/master/scripts/algorithms/GLM.dml
> // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> val cmdLineParamsPredict = Map("X" > " ", "B" > " ", "dfam" >
"...") //
> family of distribution ?
> ml.registerInput("X", Xin)
> ml.registerInput("B_full", beta, betaMC)
> ml.registerOutput("means")
> val outputsPredict =
> ml.execute("/home/systemml0.9.0SNAPSHOT/algorithms/GLMpredict.dml",
> cmdLineParamsPredict)
> val prob = out.getBinaryBlockedRDD("means");
> val probMC = out.getMatrixCharacteristics("means");
> // 
> // Get predicted label
> ml.reset()
> ml.registerInput("Prob",prob, probMC)
> ml.registerOutput("Prediction")
> val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> + "Prediction = rowIndexMax(Prob); "
> + "write(Prediction, \"tempOut\", \"csv\")")
> val pred = outputsLabels.getDF(sqlContext,
> "Prediction").withColumnRenamed("C1", "prediction")
> // 
>
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value
remain
> same ? If not how to achieve the same ?
> In above example 'pred' is a DataFrame with column 'ID' which provides
the
> row ID.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> Email: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=usnpansar
>
> [image: Inactive hide details for Sourav Mazumder 12/08/2015 10:53:40
> PMHi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> 12/08/2015 10:53:40 PMHi Niketan, Thanks again for the detailed
> inputs.
>
> From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS
> Date: 12/08/2015 10:53 PM
> Subject: Re: Using GLMpredict
> 
>
>
>
> Hi Niketan,
>
> Thanks again for the detailed inputs.
>
> Some more follow up Qs 
>
> 1. In the GLMpredict.dml I could see 'means' is the output variable. In
my
> understanding it is same as the probability matrix u have mentioned in
your
> mail (to be used to compute the prediction). Am I right ?
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
GLMpredict.dml
> as B. For registering B following statements are used
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get
4
> coefficients
>
> However, when I execute GLMpredict.dml I get following error.
>
> val outputs =
> ml.execute("/home/systemml0.9.0SNAPSHOT/algorithms/GLMpredict.dml",
> cmdLineParams)
>
> 15/12/09 05:32:47 WARN Expression: Metadata file: .mtd not provided
> 15/12/09 05:32:47 ERROR Expression: ERROR:
> /home/systemml0.9.0SNAPSHOT/algori
> thms/GLMpredict.dml  line 117, column 8  Missing or incomplete
> dimensio
> n information in read statement: .mtd
> com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> /home/syste
> mml0.9.0SNAPSHOT/algorithms/GLMpredict.dml  line 117, column 8 
> Miss
> ing or incomplete dimension information in read statement: .mtd
>
> In line 117 we have following statement : X = read (fileX);
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value
remain
> same ? If not how to achieve the same ?
>
> Regards,
> Sourav
>
>
>
>
>
> On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> > 0800*
> > <
>
https://www.mailarchive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
>
> "
> > (which I noticed in the archive).
> >
> > >> Not sure how exactly I can modify the GLMpredict.dml to get some
> > prediction to start with.
> > There are two options here:
> > 1. Modify GLMpredict.dml as suggested by Shirish (better approach with
> > respect to the SystemML optimizer) or
> >
> > 2. Run a new script on the output of GLMpredict. Please see:
> >
>
https://github.com/apache/incubatorsystemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > If you chose to go with option 2, you might also want to read the
> > documentation of following two builtin functions:
> > a. rowIndexMax (See
> >
>
http://apache.github.io/incubatorsystemml/dmllanguagereference.html#matrixandorscalarcomparisonbuiltinfunctions
> > <
>
http://apache.github.io/incubatorsystemml/dmllanguagereference.html#matrixandorscalarcomparisonbuiltinfunctions
> >
> > )
> > b. ppred
> >
> > >> Can you give me some idea how from here I can calculate the
predicted
> > value of the label using some value of probability threshold ?
> > Very simple way to predict the label given probability matrix:
> > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > probability. This assumes onebased labels.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > Email: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=usnpansar
> >
> > [image: Inactive hide details for Shirish Tatikonda 12/08/2015
> 12:49:47
> > PMHi Sourav, Yes, GLMpredict.dml gives out only the prob]Shirish
> > Tatikonda 12/08/2015 12:49:47 PMHi Sourav, Yes, GLMpredict.dml
> gives
> > out only the probabilities. You can put a
> >
> > From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/08/2015 12:49 PM
> > Subject: Re: Using GLMpredict
> > 
> >
> >
> >
> > Hi Sourav,
> >
> > Yes, GLMpredict.dml gives out only the probabilities. You can put a
> > threshold on the resulting probabilities to get the actual class labels
> 
> > for example, prob > 0.5 is positive and <=0.5 as negative.
> >
> > The exact value of threshold typically depends on the data and the
> > application. Different thresholds yield different classifiers with
> > different performance (precision, recall, etc.). You can find the best
> > threshold for the given data set by finding a value that gives the
> desired
> > classifier performance (for example, a threshold that gives roughly
equal
> > precision and recall). Such an optimization is obviously done during
the
> > training phase using a held out test set.
> >
> > If you wish, you can also modify the DML script to perform this entire
> > process.
> >
> > Shirish
> >
> >
> > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > sourav.mazumder00@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I have used GLM.dml to create a model using some sample data. It
> returns
> > to
> > > me the matrix of Beta, B.
> > >
> > > Now I want to use this matrix of Beta on a new set of data points and
> > > generate predicted value of the dependent variable/observation.
> > >
> > > When I checked GLMpredict, I could see that one can pass feature
> vector
> > > for the new data set and also the matrix of beta.
> > >
> > > But I could not see any way to get the predicted value of the
dependent
> > > variable/observation. The output parameter only supports matrix of
> > > predicted means/probabilities.
> > >
> > > Is there a way one can get the predicted value of the dependent
> > > variable/observation from GLMpredict ?
> > >
> > > Regards,
> > > Sourav
> > >
> >
> >
> >
>
>
>
