spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Alexander (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
Date Fri, 01 Apr 2016 05:31:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221190#comment-15221190
] 

Samuel Alexander commented on SPARK-14037:
------------------------------------------

Sorry to say that it doesn't improve a lot. But there is some improvement. Now the count(df)
took around 27 seconds (3 seconds lesser).

And here is the metrics that you may require

16/04/01 10:58:13 INFO RRDD: Times: boot = 0.662 s, init = 0.005 s, broadcast = 0.000 s, read-input
= 0.354 s, compute = 0.132 s, write-output = 25.195 s, total = 26.348 s


> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-14037
>                 URL: https://issues.apache.org/jira/browse/SPARK-14037
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>            Reporter: Samuel Alexander
>              Labels: performance, sparkR
>         Attachments: console.log, spark_ui.png, spark_ui_ray.png
>
>
> Any operations on dataframe created using SparkR::createDataFrame is very slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, sep=",").
And then converted into Spark dataframe using sp_df <- createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv",
source = "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But other operations
like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message