spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Ohprecio (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete
Date Mon, 21 Mar 2016 06:29:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203605#comment-15203605
] 

Vincent Ohprecio edited comment on SPARK-14031 at 3/21/16 6:28 AM:
-------------------------------------------------------------------

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-memory 4g --executor-memory
4g --driver-java-options "-XX:MaxPermSize=1024m"

Stage 5 continued to be 56 minutes. I attached VisualVM and Heap stayed under 2g, and PermGen
stayed under 200mb

Code Output: https://gist.github.com/bigsnarfdude/29518daffe4ed77dc5d4


was (Author: vohprecio):
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-memory 4g --executor-memory
4g --driver-java-options "-XX:MaxPermSize=1024m"

Stage 5 continued to be 56 minutes. I attached VisualVM and Heap stayed under 2g, and PermGem
stayed under 200mb

Code Output: https://gist.github.com/bigsnarfdude/29518daffe4ed77dc5d4

> Dataframe to csv IO, system performance enters high CPU state and write operation takes
1 hour to complete
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14031
>                 URL: https://issues.apache.org/jira/browse/SPARK-14031
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell
>    Affects Versions: 2.0.0
>         Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB and
Ubuntu14.04 Vagrant 4 Cores 8g
>            Reporter: Vincent Ohprecio
>            Priority: Minor
>         Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of dataframe
to csv, system performance enters high CPU state and write operation takes 1 hour to complete.

> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 3488272270000ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data and Stage5
csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where similar between
2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message