spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22584) dataframe write partitionBy out of disk/java heap issues
Date Thu, 23 Nov 2017 15:11:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264448#comment-16264448
] 

Sean Owen commented on SPARK-22584:
-----------------------------------

It depends on too many things: what did you transform the data into? did you cache it? how
much memory is actually allocated to Spark? driver vs executor? what ran out of memory, where?
This is too open ended for a JIRA.

> dataframe write partitionBy out of disk/java heap issues
> --------------------------------------------------------
>
>                 Key: SPARK-22584
>                 URL: https://issues.apache.org/jira/browse/SPARK-22584
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Derek M Miller
>
> I have been seeing some issues with partitionBy for the dataframe writer. I currently
have a file that is 6mb, just for testing, and it has around 1487 rows and 21 columns. There
is nothing out of the ordinary with the columns, having either a DoubleType or StringType.
The partitionBy calls two different partitions with verified low cardinality. One partition
has 30 unique values and the other one has 2 unique values.
> ```scala
>         df
>             .write.partitionBy("first", "second")
>             .mode(SaveMode.Overwrite)
>             .parquet(s"$location$example/$corrId/")
> ```
> When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory each), I
am getting a java heap out of memory error. I have maximizeResourceAllocation set, and verified
on the instances. I have even set it to false, explicitly set the driver and executor memory
to 16g, but still had the same issue. Occasionally I get an error about disk space, and the
job seems to work if I use an r3.xlarge (that has the ssd). But that seems weird that 6mb
of data needs to spill to disk.
> The problem mainly seems to be centered around two + partitions vs 1. If I just use either
of the partitions only, I have no problems. It's also worth noting that each of the partitions
are evenly distributed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message