Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Sun, 20 Mar 2016 15:53:33 +0000 (UTC)
From: "Sean Owen (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12951795.1458488220000.70502.1458489213438@Atlassian.JIRA>
In-Reply-To: <JIRA.12951795.1458488220000@Atlassian.JIRA>
References: <JIRA.12951795.1458488220000@Atlassian.JIRA>
 <JIRA.12951795.1458488220126@arcas>
Subject: [jira] [Commented] (SPARK-14031) Dataframe to csv IO, system
 performance enters high CPU state and write operation takes 1 hour to
 complete
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203335#comment-15203335 ] 

Sean Owen commented on SPARK-14031:
-----------------------------------

What stage is slow? a lot of this is irrelevant. you don't need to link screenshots or binaries. You should inline a minimal reproduction. 
It's not clear this is slow relative to the task at hand. Are you in GC thrashing? what are your settings for spark-shell? etc.

> Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14031
>                 URL: https://issues.apache.org/jira/browse/SPARK-14031
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell
>    Affects Versions: 2.0.0
>         Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>            Reporter: Vincent Ohprecio
>            Priority: Critical
>         Attachments: screenshot-1.png, screenshot-2.png
>
>
> Summary
> When in spark-shell trying to write out results of dataframe to csv, system performance enters high CPU state and write operation takes 1 hour to complete. Recreate High CPU averaging 3488272270000ns or 1hour write of csv file.
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> High CPU and 58 minute average completion time MACOSX 10.11.2
> Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB
> * Screenshot http://imgur.com/a0zYgvj
> 1.  spark-assembly-2.0.0-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.11-1.3 
> https://gist.github.com/bigsnarfdude/403e18600d42fc24cf58
> 2.  spark-assembly-2.0.0-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.11-1.2
> https://gist.github.com/bigsnarfdude/5935fcbb80233cb83cc6
> 3.  spark-assembly-2.0.0-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.11-1.4
> https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> High CPU and waited over hour for csv write but didnt wait to complete Ubuntu14.04
> * Screenshot http://imgur.com/WCmQkKj
> 1.  spark-assembly-2.0.0-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.11-1.4
> https://gist.github.com/bigsnarfdude/930f5832c231c3d39651
> 2.  spark-assembly-2.0.0-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.11-1.3  
> https://gist.github.com/bigsnarfdude/6d3a0b6733cc57dd22ac  
> Tested Working 5-6 seconds MACOSX 10.11.2
> 1.  spark-assembly-1.4.0-hadoop2.4.0.jar spark-csv_2.10-1.4.0 java1.7
> https://gist.github.com/bigsnarfdude/c540129813f3a0d7af2f
> 2.  spark-assembly-1.6.2-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.10-1.4.0 java1.7 
> https://gist.github.com/bigsnarfdude/0851fcecede9403b78fe
> Tested Working by User group mailing list
> 1.  Spark version 1.5.2 spark-csv_2.11:1.3.0 (Mich Talebzadeh)
> 2.  Spark 1.5.2 Scala 2.10 Spark-csv 1.4.0 Java 1.8 (Marco Mistroni)
> Tested Working 20-22 seconds Ubuntu 14.04
> 1.  spark-assembly-1.6.2-SNAPSHOT-hadoop2.4.0.jar spark-csv_2.10-1.4.0
> https://gist.github.com/bigsnarfdude/08b08f68aef4a4309bc0


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org