systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niketan Pansare" <npan...@us.ibm.com>
Subject Re: Improve SystemML execution speed in Spark
Date Wed, 10 May 2017 20:03:15 GMT
Hi Arijit,

Can you please put timing counters around below code to understand 20-30 seconds you observe:
1. Creation of SparkContext: 
sc = SparkContext("local[*]", "test")
2. Converting pandas to Pyspark dataframe:
> train_data= pd.read_csv("data1.csv")
> test_data     = pd.read_csv("data2.csv")
> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))


Also, you can pass pandas data frame directly to MLContext :)

Thanks 

Niketan 

> On May 10, 2017, at 10:31 AM, arijit chakraborty <akc14@hotmail.com> wrote:
> 
> Hi,
> 
> 
> I'm creating a process in SystemML, and running it through spark. I'm running the code
in the following way:
> 
> 
> # Spark Specifications:
> 
> 
> import os
> import sys
> import pandas as pd
> import numpy as np
> 
> spark_path = "C:\spark"
> os.environ['SPARK_HOME'] = spark_path
> os.environ['HADOOP_HOME'] = spark_path
> 
> sys.path.append(spark_path + "/bin")
> sys.path.append(spark_path + "/python")
> sys.path.append(spark_path + "/python/pyspark/")
> sys.path.append(spark_path + "/python/lib")
> sys.path.append(spark_path + "/python/lib/pyspark.zip")
> sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
> 
> from pyspark import SparkContext
> from pyspark import SparkConf
> 
> sc = SparkContext("local[*]", "test")
> 
> 
> # SystemML Specifications:
> 
> 
> from pyspark.sql import SQLContext
> import systemml as sml
> sqlCtx = SQLContext(sc)
> ml = sml.MLContext(sc)
> 
> 
> # Importing the data
> 
> 
> train_data= pd.read_csv("data1.csv")
> test_data     = pd.read_csv("data2.csv")
> 
> 
> 
> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))
> 
> 
> # Finally executing the code:
> 
> 
> scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
> 
> script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = test_data).output("check_func")
> 
> beta = ml.execute(script).get("check_func").toNumPy()
> 
> pd.DataFrame(beta).head(1)
> 
> The datasize are 1000 & 100 rows for train and test respectively. I'm testing it
on small dataset during development. Later will test in larger dataset. I'm running on my
local system with 4 cores.
> 
> The problem is, if I run the model in R, it's taking fraction of second. But when I'm
running like this, it's taking around 20-30 seconds.
> 
> Could anyone please suggest me how to improve the execution speed? In case there are
any other way I can execute the code, which can improve the execution speed.
> 
> Also, thank you all you guyz for releasing the 0.14 version. There are fewimprovements
 we found extremely helpful.
> 
> Thank you!
> Arijit
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message