Return-Path: X-Original-To: apmail-systemml-dev-archive@minotaur.apache.org Delivered-To: apmail-systemml-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B3A118111 for ; Tue, 8 Dec 2015 15:11:27 +0000 (UTC) Received: (qmail 13557 invoked by uid 500); 8 Dec 2015 15:11:27 -0000 Delivered-To: apmail-systemml-dev-archive@systemml.apache.org Received: (qmail 13508 invoked by uid 500); 8 Dec 2015 15:11:27 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 13487 invoked by uid 99); 8 Dec 2015 15:11:27 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Dec 2015 15:11:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C0746180428 for ; Tue, 8 Dec 2015 15:11:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.151 X-Spam-Level: *** X-Spam-Status: No, score=3.151 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id bpCIZr52LVVl for ; Tue, 8 Dec 2015 15:11:19 +0000 (UTC) Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id EB18742AA6 for ; Tue, 8 Dec 2015 15:11:18 +0000 (UTC) Received: by wmuu63 with SMTP id u63so184558719wmu.0 for ; Tue, 08 Dec 2015 07:05:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=1FR1Z19xYzv1SdE37BohqXC+rFo1mQffUhmnVRyQ5kI=; b=fJpU7JVUXIrq2pExr9AckjjkdfMpaDmyetStfzHD9vcOTbhWBZ1icVSUSHNYqwtNDG pD/D51ybBJkgLzP4rdnyYC66y/LmUtGYpk4SfTWTEQaU2XpHdD9xV8wOgCw6dUgBG1QT f7jtJxEduQaZTah72EcN/pOlArCgHCP7/p8NfCO/NE/VxQaEW5atw0VvseePdMKnYEE2 qjD/fte72bLSrAnfvV73fqvbA/JhIqTjat96UQunuQdkWaqCoEEGSGWT7i4FBegDEkau ZD/XPZ9BNqAJTWwzNhJ+f1ybCQlTQmQU4zfr7QjqGflxAuB1RgujZx851VPQPyb1jQfI Ikcw== MIME-Version: 1.0 X-Received: by 10.28.145.144 with SMTP id t138mr4925697wmd.70.1449587152953; Tue, 08 Dec 2015 07:05:52 -0800 (PST) Received: by 10.194.163.1 with HTTP; Tue, 8 Dec 2015 07:05:52 -0800 (PST) In-Reply-To: References: Date: Tue, 8 Dec 2015 07:05:52 -0800 Message-ID: Subject: Re: Using GLM with Spark From: Sourav Mazumder To: dev@systemml.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1146e5ea52aa070526644f7f --001a1146e5ea52aa070526644f7f Content-Type: text/plain; charset=UTF-8 Hi Niketan, Thanks a lot again for detailed clarification and example. I do suggest to mention explicitly in the documentation that X and y can be passed as Data Frame/RDD in case of Spark. It is not very clear from the documentation. Right now the documentation sort of gives idea that having Hadoop cluster is a need to execute this where as I'm looking for an end to end execution of System ML only using Spark (without using Hadoop at all). Next, questions I have are - a) How do I get back the B after I execute GLM in Spark ( ml.execute() ) ? I need to use the same as an input to GLM-predict for using the model. And I don't want to incur additional i/o. Can I use something like ml.get() which will return back the B in a Matrix form ? b) What is the use of the parameter cmdLineParams ? If I am anyway supplying X and y, the mandatory parameters why do I need to pass this parameter again ? Regards, Sourav On Mon, Dec 7, 2015 at 11:11 PM, Niketan Pansare wrote: > Hi Sourav, > > Your understanding is correct, X and Y can be supplied either as a file or > as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The > former mechanism (i.e. passing as file) pushes the reading/reblocking into > the optimizer, while the latter mechanism allows for preprocessing of data > (for example: using Spark SQL). > > Two use-cases when X and Y are supplied as files on HDFS: > 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master .... > SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001 > tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y > B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log > > 2. Using MLContext but without registering X and Y as input. Instead we > pass filenames as command-line parameters: > > val ml = new MLContext(sc) > > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam" > -> "2", "link" -> "2", ...) > > ml.execute("GLM.dml", cmdLineParams) > > As mentioned earlier, X and Y can be provided as RDD/DataFrame as well. > > val ml = new MLContext(sc) > > ml.registerInput("X", xDF) > > ml.registerInput("Y", yDF) > > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2", > "link" -> "2", ...) > > ml.execute("GLM.dml", cmdLineParams) > > One important thing that I must point is the concept of "ifdef". It is > explained in the section > http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments. > > Here is snippet from the DML script for GLM: > https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml > fileX = $X; > fileY = $Y; > fileO = ifdef ($O, " "); > fmtB = ifdef ($fmt, "text"); > distribution_type = ifdef ($dfam, 1); > > The above DML code essentially says $X and $Y are required parameters (a > design decision that GLM script writer made), whereas $fmt and $dfam are > optional as they are assigned default values when not explicitly provided. > Both these constructs are important tools in the arsenal of DML script > writer. By not guarding a dollar parameter with ifdef, the DML script > writer ensures that the user has to provide its value (in this case file > names for X and Y). This is why, you will notice that I have provide a > space for X, Y and B in the second MLContext snippet. > > Thanks, > > Niketan Pansare > IBM Almaden Research Center > E-mail: npansar At us.ibm.com > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar > > [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06 > PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015 > 07:30:06 PM---Hi, Trying to use GLM with Spark. > > From: Sourav Mazumder > To: dev@systemml.incubator.apache.org > Date: 12/07/2015 07:30 PM > Subject: Using GLM with Spark > ------------------------------ > > > > Hi, > > Trying to use GLM with Spark. > > I go through the documentation of the same in > > http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models > I see that inputs like X and Y have to supplied using a file and the file > has to be there in HDFS. > > Is this understanding correct ? Can't X and Y be supplied using a Data > Frame from a Spark Context (as in case of example of LinearRegression in > > http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm > ) > ? > > Regards, > Sourav > > > --001a1146e5ea52aa070526644f7f--