systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: Using GLM with Spark
Date Tue, 08 Dec 2015 15:05:52 GMT
Hi Niketan,

Thanks a lot again for detailed clarification and example.

I do suggest to mention explicitly in the documentation that X and y can be
passed as Data Frame/RDD in case of Spark. It is not very clear from the
documentation. Right now the documentation sort of gives idea that having
Hadoop cluster is a need to execute this where as I'm looking for an end to
end execution of System ML only using Spark (without using Hadoop at all).

Next, questions I have are -

a)  How do I get back the B after I execute GLM in Spark ( ml.execute() ) ?
I need to use the same as an input to GLM-predict for using the model. And
I don't want to incur additional i/o. Can I use something like ml.get()
which will return back the B in a Matrix form ?

b) What is the use of the parameter cmdLineParams ? If I am anyway
supplying X and y, the mandatory parameters why do I need to pass this
parameter again ?

Regards,
Sourav


On Mon, Dec 7, 2015 at 11:11 PM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Hi Sourav,
>
> Your understanding is correct, X and Y can be supplied either as a file or
> as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The
> former mechanism (i.e. passing as file) pushes the reading/reblocking into
> the optimizer, while the latter mechanism allows for preprocessing of data
> (for example: using Spark SQL).
>
> Two use-cases when X and Y are supplied as files on HDFS:
> 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
>
> 2. Using MLContext but without registering X and Y as input. Instead we
> pass filenames as command-line parameters:
> > val ml = new MLContext(sc)
> > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam"
> -> "2", "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > val ml = new MLContext(sc)
> > ml.registerInput("X", xDF)
> > ml.registerInput("Y", yDF)
> > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" ->
"2",
> "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> One important thing that I must point is the concept of "ifdef". It is
> explained in the section
> http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments.
>
> Here is snippet from the DML script for GLM:
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> fileX = $X;
> fileY = $Y;
> fileO = ifdef ($O, " ");
> fmtB = ifdef ($fmt, "text");
> distribution_type = ifdef ($dfam, 1);
>
> The above DML code essentially says $X and $Y are required parameters (a
> design decision that GLM script writer made), whereas $fmt and $dfam are
> optional as they are assigned default values when not explicitly provided.
> Both these constructs are important tools in the arsenal of DML script
> writer. By not guarding a dollar parameter with ifdef, the DML script
> writer ensures that the user has to provide its value (in this case file
> names for X and Y). This is why, you will notice that I have provide a
> space for X, Y and B in the second MLContext snippet.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> 07:30:06 PM---Hi, Trying to use GLM with Spark.
>
> From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/07/2015 07:30 PM
> Subject: Using GLM with Spark
> ------------------------------
>
>
>
> Hi,
>
> Trying to use GLM with Spark.
>
> I go through the documentation of the same in
>
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> I see that inputs like X and Y have to supplied using a file and the file
> has to be there in HDFS.
>
> Is this understanding correct ? Can't X and Y be supplied using a Data
> Frame from a Spark Context (as in case of example of LinearRegression in
>
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
> )
> ?
>
> Regards,
> Sourav
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message