spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manish Kumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16169) Saving Intermediate dataframe increasing processing time upto 5 times.
Date Fri, 24 Jun 2016 09:14:16 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348043#comment-15348043
] 

Manish Kumar commented on SPARK-16169:
--------------------------------------

Even if our code is asking to do more work then some task should be in running status but
 all the tasks and job gets completed within first 10 minutes and the application keeps running
for around 50 minutes. And that's what shown in the attached screenshot of spark UI. 

In the attached SPARK UI you can see that all the jobs are in completed status which happens
in the first 10 minutes of execution itself. But the total execution time of spark application
is 50 minutes.

> Saving Intermediate dataframe increasing processing time upto 5 times.
> ----------------------------------------------------------------------
>
>                 Key: SPARK-16169
>                 URL: https://issues.apache.org/jira/browse/SPARK-16169
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Submit, Web UI
>    Affects Versions: 1.6.1
>         Environment: Amazon EMR
>            Reporter: Manish Kumar
>              Labels: performance
>         Attachments: Spark-UI.png
>
>
> When a spark application is (written in scala) trying to save intermediate dataframe,
the application is taking processing time almost 5 times. 
> Although the spark-UI clearly shows that all jobs are completed but the spark application
remains in running status.
> Below is the command for saving the intermediate output and then using the dataframe.
> {noformat}
> saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, previousDataFrame, sqlContext)
> previousDataFrame.count
> {noformat}
> Here, previousDataFrame is the result of the last step and saveDataFrame is just saving
the DataFrame as given location, then the previousDataFrame will be used by next steps/transformation.

> Below is the spark UI screenshot which shows jobs completed although some task inside
it are neither completed nor skipped.
> !Spark-UI.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message