systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: Using GLM-predict
Date Thu, 10 Dec 2015 16:00:18 GMT
Hi Niketan,

Thanks for the exaplanation.

While trying out the new build from github I'm facing issue.

I downloaded the zip from github and rebuilt the package using 'mvn clean
package'.

The first thing I noticed that in the target folder there is no .tar files
for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
was created previously when I downloaded the previous version form the
github. However I tried system-ml-0.9.0-SNAPSHOT.jar. But with that I
started getting problem the package name. I could run finally the things
after changing the package structure to org.apache.sysml. Please update the
documentations accordingly.

However, when I tried running GLM-predict after adding a new column as ID
the GLM-predict has started failing.

Here is the code I'm executing -

val beta = outputs.getBinaryBlockedRDD("beta_out")
val betaMC = outputs.getMatrixCharacteristics("beta_out")

val Xin = sqlContext.sql("select Res_Area, Bldg_Area, Lot_Area, Bldg_Age
from modeldf")

val predDfIn = RDDConverterUtils.addIDToDataFrame(Xin, sqlContext, "ID")

val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ")
ml.registerInput("X", predDfIn)
ml.registerInput("B_full", beta, betaMC)
ml.registerOutput("means")

val outputsPredict =
ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
cmdLineParamsPredict)

The error is -

org.apache.sysml.runtime.DMLRuntimeException:
org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in
program block generated from statement block between lines 122 and 123 --
Error evaluating instruction:
CP°rangeReIndex°B_full·MATRIX·DOUBLE°1·SCALAR·INT·true°5·SCALAR·INT·true°1·SCALAR·INT·true°1·SCALAR·INT·true°_mVar10563·MATRIX·DOUBLE
at
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:153)
at
org.apache.sysml.api.MLContext.executeUsingSimplifiedCompilationChain(MLContext.java:1337)
at
org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1203)
at
org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1149)
at org.apache.sysml.api.MLContext.execute(MLContext.java:631) at
org.apache.sysml.api.MLContext.execute(MLContext.java:666) at
org.apache.sysml.api.MLContext.execute(MLContext.java:679) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) at
$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) at
$iwC$$iwC$$iwC$$iwC.<init>(<console>:62) at
$iwC$$iwC$$iwC.<init>(<console>:64) at $iwC$$iwC.<init>(<console>:66)
at
$iwC.<init>(<console>:68) at <init>(<console>:70) at .<init>(<console>:74)
at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>)
at
$print(<console>)

Regards,
Sourav

On Wed, Dec 9, 2015 at 9:56 PM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Hi Sourav,
>
> There are two possible options here:
> 1. If "unique_id" is one-based integer column: In this case, please
> rename "unique_id" column to ID and use registerInput("X", DF1, true)
> method.
>
> 2. If "unique_id" is anything else (for example: String), then there is
> no trivial way for SystemML to correlate "string-based unique id" to row
> index (which is required to interpret a DataFrame into a matrix). This
> means you have to explicitly add the column ID to DF1:
> val dataset = RDDConverterUtilsExt.*addIDToDataFrame*(DF1, sqlContext,
> "ID")
>
> When you get DF5 from GLM-predict.dml, you can use following two lines of
> code which guarantees correct mapping:
> val DF5 = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1",
> "prediction") // Note: there already is a column ID in DF5 which
> specifies the row index.
> val output = dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID"
> )))
>
> Note: once DataFrame is passed to SystemML via registerInput, SystemML
> first converts the DataFrame into binary block (i.e.
> JavaPairRDD<MatrixIndexes, MatrixBlock>) and executes GLM-predict.dml using
> the binary block. After execution, the output is present in MLOutput (
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/MLOutput.java#L89)
> in binary block format. If user choses to, he/she may call getDF(...) which
> does DataFrame to binary block conversion.
>
> For DataFrame to binary block conversion, see
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277
> ... ordering specified by zipWithIndex (which is also used by
> RDDConverterUtilsExt.*addIDToDataFrame*)
> For binary block to DataFrame conversion, see
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364
> ... ordering specified by internal binary block format and hence we append
> an extra column ID to specify this ordering.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 06:20:24
> PM---Hi Niketan, Thanks again for such a detailed explanation.]Sourav
> Mazumder ---12/09/2015 06:20:24 PM---Hi Niketan, Thanks again for such a
> detailed explanation. I see your last point and in
>
> From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 06:20 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks again for such a detailed explanation. I see your last point and in
> agreement with the same. Also I got your point on the use of "means" for
> gaussian vs other distributions.
>
> However, I'm still not convinced about the approach you mentioned for
> correlating the unique id. I've already tried a code similar to what you
> sent where I've used the vectorAssembler utility of Spark ML LIb.
>
> Let me try to explain the problem with more details -
>
> 1. Say my original data frame DF1 is distributed in 3 slave nodes in a
> Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
> unique identifier column say unique_id.
> 2. Now I used your code to create the feature vector from DF1 and pass it
> to GLM-predict. And GLM-predict in turn returns me another data frame (say
> DF5) of "means" (in this case say prediction). However, the rows of DF5 may
> be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
> 3. Now if I just add this new data frame (DF5) as additional two columns to
> DF1 where is the guarantee that for a specific unique_id of DF1 I'm getting
> right mean/predicted value corresponding to unique_id ?
>
> Regards,
> Sourav
>
>
>
> On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > Please see below comments:
> >
> > >> I was basically hoping for some sort of API where one can pass the
> > original
> > data frame and from that dataframe can specify the columns to be used as
> > feature and the column to be used for label. This model can work well for
> > both creating the model and getting the prediction.
> > Please use the most recent jar from git. To extract X and Y from your
> > dataframe without IDs, use following code:
> > import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > val features = Array("lat", "height", "precipitation", "pressure")
> > val Xmc = new MatrixCharacteristics() // SystemML will set them for you
> if
> > the dimensions are unknown
> > val Ymc = new MatrixCharacteristics()
> > val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc,
> features)
> > val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> > Array("temperature"))
> >
> > If you want to add specific ordering to your DataFrame rows (let's say
> for
> > prediction ... in most cases it is not required), use following method:
> > import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
> >
> > >> 1. Yes dependent variables are nothing but labels
> > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> The
> > values can be any double number. For example say in a weather data set
> you
> > have fields like lat, long, height (from sea level), precipitation,
> > pressure, temperature. Now one way you can create a model where
> Temperature
> > is the dependent variable and other are features (the hypothesis is
> > Temperature is some function of pressure, precipitation, height, latitude
> > and longitude.
> > Sorry, in this case, please ignore my earlier suggestion of "Prediction =
> > rowIndexMax(Prob)" as it applies only to classification.
> > In your case, the returned values are "means" of the distribution family
> > which was used (See
> >
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> ).
> > If Gaussian distribution was used (dfam=1, vpow=0.0), and if the problem
> > was linear and if you expected pointy-hat distribution (i.e. positive
> > kurtosis), then you can simply return the mean as predicted label. This
> is
> > because in case of Gaussian distribution, mean is also the mode. In other
> > case, it might not necessarily be true.
> >
> > You may ask why are we making it so complicated and why not just return
> > the predicted labels instead of probability ?
> > Well, the problem of labelling is not as simple as it appears and it
> > highly depends on the problem setting. Let's consider the problem of
> > multi-class classification and my earlier suggestion "Prediction =
> > rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> > throat, birth defect, fever, normal}. If for a given test example, let's
> > say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> > throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then according
> > to "Prediction = rowIndexMax(Prob)", we should output the label "normal"
> > and send the patient home ... right ? No. In this case, 20% probability
> of
> > cancer is just way too high for a doctor to send the patient home. In
> this
> > setting, the doctor might then say to the data scientist: I know that
> based
> > on the prevalence of cancer in general public, and based on that domain
> > knowledge, I suggest that probability over "threshold" should always be
> > flagged as cancer. Else output the label with highest probability. Using
> > this suggestion, the data scientist modifies the DML as follows:
> > zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> > prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
> >
> > This also shows the usefulness of "Declarative Machine Learning" :)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 01:15:30
> > PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> > ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
> >
> > From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 01:15 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Firstly to answer your Qs -
> >
> > 1. Yes dependent variables are nothing but labels
> > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> The
> > values can be any double number. For example say in a weather data set
> you
> > have fields like lat, long, height (from sea level), precipitation,
> > pressure, temperature. Now one way you can create a model where
> Temperature
> > is the dependent variable and other are features (the hypothesis is
> > Temperature is some function of pressure, precipitation, height, latitude
> > and longitude.
> >
> > Not sure about the correlation between step 2 and step 3 in your mail. In
> > step 3 does one have to pass 'ID' column (created in step 2) to varName
> > while calling registerInput(String varName, DataFrame df, containsID) ?
> >
> > However the unique Id in typical case can be string. Can't that be used
> as
> > is instead ? This means one has to first convert the original unique id
> to
> > integer to create an additional unique id column and then again later on
> > that integer unique id has to mapped back.
> >
> > I was basically hoping for some sort of API where one can pass the
> original
> > data frame and from that dataframe can specify the columns to be used as
> > feature and the column to be used for label. This model can work well for
> > both creating the model and getting the prediction.
> >
> > Regards,
> > Sourav
> >
> > On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <npansar@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > Couple of questions to make sure we are on same page: does the
> "dependent
> > > variable (double)" represents the class labels ? Are the values of the
> > > class labels from 1 to numClasses (i..e one-based) ?
> > >
> > > Here are few comments regarding correlating IDs:
> > >
> > > To represent an unordered collection (i.e. DataFrame) to an ordered
> > > collection ("Matrix"), we add special column "ID" which represents
> > *one-based
> > > row index*. Please perform following steps:
> > > 1. Accept recent changes from
> > https://github.com/apache/incubator-systemml
> > > and use the generated jar.
> > >
> > > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call
> that
> > > column 'ID'.
> > >
> > > 3. Use the variant of registerInput for both X (both for training and
> > > predicting) and Y:
> > > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> > >
> > > As a side note: instead of separate double columns, you can represent
> > them
> > > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc,
> DataFrame
> > > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > > vectorColumnName) "
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> 11:15:19
> > > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> > > works fine. The use of getMatrixCharacteristics
> > >
> > > From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/09/2015 11:15 AM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > The code you provided works fine. The use of getMatrixCharacteristics
> > > solves the basic execution problem.
> > >
> > > However, question #3 is probably not yet unresolved. Let me explain the
> > use
> > > case scenario I'm trying to build.
> > >
> > > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> > > columns (say 4) which are to be used as features (double), and a column
> > for
> > > the dependent variable (double).
> > > 2. When I created the model I created a data frame (DF2) from DF1 using
> > > only the feature vectors and pass that as X. And the column with
> > dependent
> > > value is passed as Y.
> > > 3. For calling the GLM-predict I'm using another data frame (DF3) of
> same
> > > structure but with different Unique ID (essentially different
> > > records/rows). From that data frame I'm first creating another data
> frame
> > > (DF4) containing the columns representing the features. Then I'm
> sending
> > > DF4 to GLM-predict which has only feature vectors.
> > > 4. The response I get from GLM-predict is the 'means'. Then I'm using
> the
> > > inline predict script which returns another data frame {DF5) with ID
> and
> > > Predicted values.
> > >
> > > The question is how do I correlate the ID I'm getting from DF5 with the
> > > Unique ID of the data frame DF3 ?
> > >
> > > Regards,
> > > Sourav
> > >
> > >
> > >
> > >
> > > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <npansar@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> > In
> > > my
> > > > understanding it is same as the probability matrix u have mentioned
> in
> > > your
> > > > mail (to be used to compute the prediction). Am I right ?
> > > > Yes, that's correct.
> > > >
> > > > 2. From GLM.dml I get the 'betas' as output using
> > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > GLM-predict.dml
> > > > as B.
> > > >
> > > > Can you try this ?
> > > > // Get output from GLM
> > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way
> > you
> > > > don't have to worry about dimensions.
> > > > // -----------------------------------------
> > > > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > > > predict
> > > > // -----------------------------------------
> > > > // Execute GLM-predict
> > > > ml.reset()
> > > > // Please read
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
> "...")
> > > //
> > > > family of distribution ?
> > > > ml.registerInput("X", Xin)
> > > > ml.registerInput("B_full", beta, betaMC)
> > > > ml.registerOutput("means")
> > > > val outputsPredict =
> > > >
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > cmdLineParamsPredict)
> > > > val prob = out.getBinaryBlockedRDD("means");
> > > > val probMC = out.getMatrixCharacteristics("means");
> > > > // -----------------------------------------
> > > > // Get predicted label
> > > > ml.reset()
> > > > ml.registerInput("Prob",prob, probMC)
> > > > ml.registerOutput("Prediction")
> > > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > > > + "Prediction = rowIndexMax(Prob); "
> > > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > > val pred = outputsLabels.getDF(sqlContext,
> > > > "Prediction").withColumnRenamed("C1", "prediction")
> > > > // -----------------------------------------
> > > >
> > > >
> > > > 3. Say I get back prediction matrix as an output (from predictions =
> > > > rowIndexMax(means);). Now can I read add that as a column to my
> > original
> > > > data frame (the one from which I created the feature vector for the
> > > > original model) ? My concern is whether adding back will ensure the
> > right
> > > > order so that teh key for the feature vector and the predicted value
> > > remain
> > > > same ? If not how to achieve the same ?
> > > > In above example 'pred' is a DataFrame with column 'ID' which
> provides
> > > the
> > > > row ID.
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> > 10:53:40
> > > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav
> Mazumder
> > > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > > > inputs.
> > > >
> > > > From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> > > > To: dev@systemml.incubator.apache.org, Niketan
> > Pansare/Almaden/IBM@IBMUS
> > > > Date: 12/08/2015 10:53 PM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Niketan,
> > > >
> > > > Thanks again for the detailed inputs.
> > > >
> > > > Some more follow up Qs -
> > > >
> > > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> > In
> > > my
> > > > understanding it is same as the probability matrix u have mentioned
> in
> > > your
> > > > mail (to be used to compute the prediction). Am I right ?
> > > >
> > > > 2. From GLM.dml I get the 'betas' as output using
> > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > GLM-predict.dml
> > > > as B. For registering B following statements are used
> > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> > > get 4
> > > > coefficients
> > > >
> > > > However, when I execute GLM-predict.dml I get following error.
> > > >
> > > > val outputs =
> > > >
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > cmdLineParams)
> > > >
> > > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > > dimensio
> > > > n information in read statement:  .mtd
> > > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > > > /home/syste
> > > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8
> --
> > > > Miss
> > > > ing or incomplete dimension information in read statement:  .mtd
> > > >
> > > > In line 117 we have following statement : X = read (fileX);
> > > >
> > > > 3. Say I get back prediction matrix as an output (from predictions =
> > > > rowIndexMax(means);). Now can I read add that as a column to my
> > original
> > > > data frame (the one from which I created the feature vector for the
> > > > original model) ? My concern is whether adding back will ensure the
> > right
> > > > order so that teh key for the feature vector and the predicted value
> > > remain
> > > > same ? If not how to achieve the same ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <npansar@us.ibm.com>
> > > > wrote:
> > > >
> > > > > Hi Sourav,
> > > > >
> > > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> > 12:56:38
> > > > > -0800*
> > > > > <
> > > >
> > >
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> > > >
> > > > "
> > > > > (which I noticed in the archive).
> > > > >
> > > > > >> Not sure how exactly I can modify the GLM-predict.dml to
get
> some
> > > > > prediction to start with.
> > > > > There are two options here:
> > > > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach
> > with
> > > > > respect to the SystemML optimizer) or
> > > > >
> > > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > > > If you chose to go with option 2, you might also want to read the
> > > > > documentation of following two built-in functions:
> > > > > a. rowIndexMax (See
> > > > >
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > > <
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > >
> > > > > )
> > > > > b. ppred
> > > > >
> > > > > >> Can you give me some idea how from here I can calculate
the
> > > predicted
> > > > > value of the label using some value of probability threshold ?
> > > > > Very simple way to predict the label given probability matrix:
> > > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > > probability. This assumes one-based labels.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Niketan Pansare
> > > > > IBM Almaden Research Center
> > > > > E-mail: npansar At us.ibm.com
> > > > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > >
> > > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > > 12:49:47
> > > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the
> prob]Shirish
> > > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,
> GLM-predict.dml
> > > > gives
> > > > > out only the probabilities. You can put a
> > > > >
> > > > > From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
> > > > > To: dev@systemml.incubator.apache.org
> > > > > Date: 12/08/2015 12:49 PM
> > > > > Subject: Re: Using GLM-predict
> > > > > ------------------------------
> > > > >
> > > > >
> > > > >
> > > > > Hi Sourav,
> > > > >
> > > > > Yes, GLM-predict.dml gives out only the probabilities. You can put
> a
> > > > > threshold on the resulting probabilities to get the actual class
> > labels
> > > > --
> > > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > > >
> > > > > The exact value of threshold typically depends on the data and the
> > > > > application. Different thresholds yield different classifiers with
> > > > > different performance (precision, recall, etc.). You can find the
> > best
> > > > > threshold for the given data set by finding a value that gives the
> > > > desired
> > > > > classifier performance (for example, a threshold that gives roughly
> > > equal
> > > > > precision and recall). Such an optimization is obviously done
> during
> > > the
> > > > > training phase using a held out test set.
> > > > >
> > > > > If you wish, you can also modify the DML script to perform this
> > entire
> > > > > process.
> > > > >
> > > > > Shirish
> > > > >
> > > > >
> > > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > > sourav.mazumder00@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have used GLM.dml to create a model using some sample data.
It
> > > > returns
> > > > > to
> > > > > > me the matrix of Beta, B.
> > > > > >
> > > > > > Now I want to use this matrix of Beta on a new set of data points
> > and
> > > > > > generate predicted value of the dependent variable/observation.
> > > > > >
> > > > > > When I checked GLM-predict, I could see that one can pass feature
> > > > vector
> > > > > > for the new data set and also the matrix of beta.
> > > > > >
> > > > > > But I could not see any way to get the predicted value of the
> > > dependent
> > > > > > variable/observation. The output parameter only supports matrix
> of
> > > > > > predicted means/probabilities.
> > > > > >
> > > > > > Is there a way one can get the predicted value of the dependent
> > > > > > variable/observation from GLM-predict ?
> > > > > >
> > > > > > Regards,
> > > > > > Sourav
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message