systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: Using GLM with Spark
Date Tue, 08 Dec 2015 15:10:50 GMT
Hi Sirish,

This is cool.

I typically achieve the same using Spark ML Lib utilities like
VectorAssembler and DataFrame utilities

So this brings a question in my mind when to use DML script for this type
of data preparation and when to use the available libraries in the existing
platform to do the. Any suggestion ?

Regards,
Sourav

On Tue, Dec 8, 2015 at 12:29 AM, Shirish Tatikonda <
shirish.tatikonda@gmail.com> wrote:

> Hi Saurav,
>
> Just to add to Niketan's response, you can find a utility DML script to
> split a data set into X and Y at [1]. This obviously is useful only if you
> have one unified data set with both X and Y.
>
> [1]
>
> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/splitXY.dml
>
> Shirish
> On Dec 7, 2015 11:11 PM, "Niketan Pansare" <npansar@us.ibm.com> wrote:
>
> > Hi Sourav,
> >
> > Your understanding is correct, X and Y can be supplied either as a file
> or
> > as a RDD/DataFrame. Each of these two mechanisms has its own benefits.
> The
> > former mechanism (i.e. passing as file) pushes the reading/reblocking
> into
> > the optimizer, while the latter mechanism allows for preprocessing of
> data
> > (for example: using Spark SQL).
> >
> > Two use-cases when X and Y are supplied as files on HDFS:
> > 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> > SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> > tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> > B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
> >
> > 2. Using MLContext but without registering X and Y as input. Instead we
> > pass filenames as command-line parameters:
> > > val ml = new MLContext(sc)
> > > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y",
> "dfam"
> > -> "2", "link" -> "2", ...)
> > > ml.execute("GLM.dml", cmdLineParams)
> >
> > As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > > val ml = new MLContext(sc)
> > > ml.registerInput("X", xDF)
> > > ml.registerInput("Y", yDF)
> > > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" ->
> "2",
> > "link" -> "2", ...)
> > > ml.execute("GLM.dml", cmdLineParams)
> >
> > One important thing that I must point is the concept of "ifdef". It is
> > explained in the section
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments
> .
> >
> > Here is snippet from the DML script for GLM:
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > fileX = $X;
> > fileY = $Y;
> > fileO = ifdef ($O, " ");
> > fmtB = ifdef ($fmt, "text");
> > distribution_type = ifdef ($dfam, 1);
> >
> > The above DML code essentially says $X and $Y are required parameters (a
> > design decision that GLM script writer made), whereas $fmt and $dfam are
> > optional as they are assigned default values when not explicitly
> provided.
> > Both these constructs are important tools in the arsenal of DML script
> > writer. By not guarding a dollar parameter with ifdef, the DML script
> > writer ensures that the user has to provide its value (in this case file
> > names for X and Y). This is why, you will notice that I have provide a
> > space for X, Y and B in the second MLContext snippet.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> > PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> > 07:30:06 PM---Hi, Trying to use GLM with Spark.
> >
> > From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/07/2015 07:30 PM
> > Subject: Using GLM with Spark
> > ------------------------------
> >
> >
> >
> > Hi,
> >
> > Trying to use GLM with Spark.
> >
> > I go through the documentation of the same in
> >
> >
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> > I see that inputs like X and Y have to supplied using a file and the file
> > has to be there in HDFS.
> >
> > Is this understanding correct ? Can't X and Y be supplied using a Data
> > Frame from a Spark Context (as in case of example of LinearRegression in
> >
> >
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
> > )
> > ?
> >
> > Regards,
> > Sourav
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message