systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: Using GLM-predict
Date Wed, 09 Dec 2015 21:15:04 GMT
Hi Niketan,

Firstly to answer your Qs -

1. Yes dependent variables are nothing but labels
2. The values of the dependent variable are not 1 to totalNumOfClasses. The
values can be any double number. For example say in a weather data set you
have fields like lat, long, height (from sea level), precipitation,
pressure, temperature. Now one way you can create a model where Temperature
is the dependent variable and other are features (the hypothesis is
Temperature is some function of pressure, precipitation, height, latitude
and longitude.

Not sure about the correlation between step 2 and step 3 in your mail. In
step 3 does one have to pass 'ID' column (created in step 2) to varName
while calling registerInput(String varName, DataFrame df, containsID) ?

However the unique Id in typical case can be string. Can't that be used as
is instead ? This means one has to first convert the original unique id to
integer to create an additional unique id column and then again later on
that integer unique id has to mapped back.

I was basically hoping for some sort of API where one can pass the original
data frame and from that dataframe can specify the columns to be used as
feature and the column to be used for label. This model can work well for
both creating the model and getting the prediction.

Regards,
Sourav

On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Hi Sourav,
>
> Couple of questions to make sure we are on same page: does the "dependent
> variable (double)" represents the class labels ? Are the values of the
> class labels from 1 to numClasses (i..e one-based) ?
>
> Here are few comments regarding correlating IDs:
>
> To represent an unordered collection (i.e. DataFrame) to an ordered
> collection ("Matrix"), we add special column "ID" which represents *one-based
> row index*. Please perform following steps:
> 1. Accept recent changes from https://github.com/apache/incubator-systemml
> and use the generated jar.
>
> 2. Map the unique id in DF1 to int (*1 to number of rows*) and call that
> column 'ID'.
>
> 3. Use the variant of registerInput for both X (both for training and
> predicting) and Y:
> registerInput(String varName, DataFrame df, *b**oolean* containsID)
>
> As a side note: instead of separate double columns, you can represent them
> using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
> inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> vectorColumnName) "
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 11:15:19
> AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> works fine. The use of getMatrixCharacteristics
>
> From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 11:15 AM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> The code you provided works fine. The use of getMatrixCharacteristics
> solves the basic execution problem.
>
> However, question #3 is probably not yet unresolved. Let me explain the use
> case scenario I'm trying to build.
>
> 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> columns (say 4) which are to be used as features (double), and a column for
> the dependent variable (double).
> 2. When I created the model I created a data frame (DF2) from DF1 using
> only the feature vectors and pass that as X. And the column with dependent
> value is passed as Y.
> 3. For calling the GLM-predict I'm using another data frame (DF3) of same
> structure but with different Unique ID (essentially different
> records/rows). From that data frame I'm first creating another data frame
> (DF4) containing the columns representing the features. Then I'm sending
> DF4 to GLM-predict which has only feature vectors.
> 4. The response I get from GLM-predict is the 'means'. Then I'm using the
> inline predict script which returns another data frame {DF5) with ID and
> Predicted values.
>
> The question is how do I correlate the ID I'm getting from DF5 with the
> Unique ID of the data frame DF3 ?
>
> Regards,
> Sourav
>
>
>
>
> On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > 1. In the GLM-predict.dml I could see 'means' is the output variable. In
> my
> > understanding it is same as the probability matrix u have mentioned in
> your
> > mail (to be used to compute the prediction). Am I right ?
> > Yes, that's correct.
> >
> > 2. From GLM.dml I get the 'betas' as output using
> > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> GLM-predict.dml
> > as B.
> >
> > Can you try this ?
> > // Get output from GLM
> > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
> > don't have to worry about dimensions.
> > // -----------------------------------------
> > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > predict
> > // -----------------------------------------
> > // Execute GLM-predict
> > ml.reset()
> > // Please read
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...")
> //
> > family of distribution ?
> > ml.registerInput("X", Xin)
> > ml.registerInput("B_full", beta, betaMC)
> > ml.registerOutput("means")
> > val outputsPredict =
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > cmdLineParamsPredict)
> > val prob = out.getBinaryBlockedRDD("means");
> > val probMC = out.getMatrixCharacteristics("means");
> > // -----------------------------------------
> > // Get predicted label
> > ml.reset()
> > ml.registerInput("Prob",prob, probMC)
> > ml.registerOutput("Prediction")
> > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > + "Prediction = rowIndexMax(Prob); "
> > + "write(Prediction, \"tempOut\", \"csv\")")
> > val pred = outputsLabels.getDF(sqlContext,
> > "Prediction").withColumnRenamed("C1", "prediction")
> > // -----------------------------------------
> >
> >
> > 3. Say I get back prediction matrix as an output (from predictions =
> > rowIndexMax(means);). Now can I read add that as a column to my original
> > data frame (the one from which I created the feature vector for the
> > original model) ? My concern is whether adding back will ensure the right
> > order so that teh key for the feature vector and the predicted value
> remain
> > same ? If not how to achieve the same ?
> > In above example 'pred' is a DataFrame with column 'ID' which provides
> the
> > row ID.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40
> > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > inputs.
> >
> > From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> > To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS
> > Date: 12/08/2015 10:53 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Thanks again for the detailed inputs.
> >
> > Some more follow up Qs -
> >
> > 1. In the GLM-predict.dml I could see 'means' is the output variable. In
> my
> > understanding it is same as the probability matrix u have mentioned in
> your
> > mail (to be used to compute the prediction). Am I right ?
> >
> > 2. From GLM.dml I get the 'betas' as output using
> > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> GLM-predict.dml
> > as B. For registering B following statements are used
> > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> get 4
> > coefficients
> >
> > However, when I execute GLM-predict.dml I get following error.
> >
> > val outputs =
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > cmdLineParams)
> >
> > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > /home/system-ml-0.9.0-SNAPSHOT/algori
> > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > dimensio
> > n information in read statement:  .mtd
> > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > /home/syste
> > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> > Miss
> > ing or incomplete dimension information in read statement:  .mtd
> >
> > In line 117 we have following statement : X = read (fileX);
> >
> > 3. Say I get back prediction matrix as an output (from predictions =
> > rowIndexMax(means);). Now can I read add that as a column to my original
> > data frame (the one from which I created the feature vector for the
> > original model) ? My concern is whether adding back will ensure the right
> > order so that teh key for the feature vector and the predicted value
> remain
> > same ? If not how to achieve the same ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> >
> >
> > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <npansar@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> > > -0800*
> > > <
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> >
> > "
> > > (which I noticed in the archive).
> > >
> > > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > > prediction to start with.
> > > There are two options here:
> > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> > > respect to the SystemML optimizer) or
> > >
> > > 2. Run a new script on the output of GLM-predict. Please see:
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > If you chose to go with option 2, you might also want to read the
> > > documentation of following two built-in functions:
> > > a. rowIndexMax (See
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > <
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > >
> > > )
> > > b. ppred
> > >
> > > >> Can you give me some idea how from here I can calculate the
> predicted
> > > value of the label using some value of probability threshold ?
> > > Very simple way to predict the label given probability matrix:
> > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > probability. This assumes one-based labels.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > 12:49:47
> > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> > gives
> > > out only the probabilities. You can put a
> > >
> > > From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/08/2015 12:49 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Sourav,
> > >
> > > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > > threshold on the resulting probabilities to get the actual class labels
> > --
> > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > >
> > > The exact value of threshold typically depends on the data and the
> > > application. Different thresholds yield different classifiers with
> > > different performance (precision, recall, etc.). You can find the best
> > > threshold for the given data set by finding a value that gives the
> > desired
> > > classifier performance (for example, a threshold that gives roughly
> equal
> > > precision and recall). Such an optimization is obviously done during
> the
> > > training phase using a held out test set.
> > >
> > > If you wish, you can also modify the DML script to perform this entire
> > > process.
> > >
> > > Shirish
> > >
> > >
> > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > sourav.mazumder00@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have used GLM.dml to create a model using some sample data. It
> > returns
> > > to
> > > > me the matrix of Beta, B.
> > > >
> > > > Now I want to use this matrix of Beta on a new set of data points and
> > > > generate predicted value of the dependent variable/observation.
> > > >
> > > > When I checked GLM-predict, I could see that one can pass feature
> > vector
> > > > for the new data set and also the matrix of beta.
> > > >
> > > > But I could not see any way to get the predicted value of the
> dependent
> > > > variable/observation. The output parameter only supports matrix of
> > > > predicted means/probabilities.
> > > >
> > > > Is there a way one can get the predicted value of the dependent
> > > > variable/observation from GLM-predict ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message