systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arijit chakraborty <ak...@hotmail.com>
Subject Re: Improve SystemML execution speed in Spark
Date Thu, 11 May 2017 20:27:21 GMT
Thank you Niketan for your reply! I was actually putting the timer in the dml code part. Rest
of the portion were almost instantaneous. The dml code part was taking time. And I could not
able to figure out why it could be.


Thanks again!

Arijit

________________________________
From: Niketan Pansare <npansar@us.ibm.com>
Sent: Thursday, May 11, 2017 1:33:15 AM
To: dev@systemml.incubator.apache.org
Subject: Re: Improve SystemML execution speed in Spark

Hi Arijit,

Can you please put timing counters around below code to understand 20-30 seconds you observe:
1. Creation of SparkContext:
sc = SparkContext("local[*]", "test")
2. Converting pandas to Pyspark dataframe:
> train_data= pd.read_csv("data1.csv")
> test_data     = pd.read_csv("data2.csv")
> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))


Also, you can pass pandas data frame directly to MLContext :)

Thanks

Niketan

> On May 10, 2017, at 10:31 AM, arijit chakraborty <akc14@hotmail.com> wrote:
>
> Hi,
>
>
> I'm creating a process in SystemML, and running it through spark. I'm running the code
in the following way:
>
>
> # Spark Specifications:
>
>
> import os
> import sys
> import pandas as pd
> import numpy as np
>
> spark_path = "C:\spark"
> os.environ['SPARK_HOME'] = spark_path
> os.environ['HADOOP_HOME'] = spark_path
>
> sys.path.append(spark_path + "/bin")
> sys.path.append(spark_path + "/python")
> sys.path.append(spark_path + "/python/pyspark/")
> sys.path.append(spark_path + "/python/lib")
> sys.path.append(spark_path + "/python/lib/pyspark.zip")
> sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
>
> from pyspark import SparkContext
> from pyspark import SparkConf
>
> sc = SparkContext("local[*]", "test")
>
>
> # SystemML Specifications:
>
>
> from pyspark.sql import SQLContext
> import systemml as sml
> sqlCtx = SQLContext(sc)
> ml = sml.MLContext(sc)
>
>
> # Importing the data
>
>
> train_data= pd.read_csv("data1.csv")
> test_data     = pd.read_csv("data2.csv")
>
>
>
> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))
>
>
> # Finally executing the code:
>
>
> scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
>
> script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = test_data).output("check_func")
>
> beta = ml.execute(script).get("check_func").toNumPy()
>
> pd.DataFrame(beta).head(1)
>
> The datasize are 1000 & 100 rows for train and test respectively. I'm testing it
on small dataset during development. Later will test in larger dataset. I'm running on my
local system with 4 cores.
>
> The problem is, if I run the model in R, it's taking fraction of second. But when I'm
running like this, it's taking around 20-30 seconds.
>
> Could anyone please suggest me how to improve the execution speed? In case there are
any other way I can execute the code, which can improve the execution speed.
>
> Also, thank you all you guyz for releasing the 0.14 version. There are fewimprovements
 we found extremely helpful.
>
> Thank you!
> Arijit
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message