spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Cheah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-8597) DataFrame partitionBy memory pressure scales extremely poorly
Date Thu, 25 Jun 2015 18:33:06 GMT

    [ https://issues.apache.org/jira/browse/SPARK-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601687#comment-14601687
] 

Matt Cheah commented on SPARK-8597:
-----------------------------------

I did some more digging. The memory space is taken up by a hash map mapping the path to parquet
files to the Parquet writer object that will write to that path. It's in DynamicPartitionWriterContainer.

YourKit profiling reveals that each output writer takes about 1 MB of space; I'm not sure
if I'm misreading the profile here, but that would explain the memory explosion.

I'm trying to figure out a way to reduce the memory footprint here and write a PR accordingly.
[~marmbrus] do you have any thoughts around this, perhaps?

> DataFrame partitionBy memory pressure scales extremely poorly
> -------------------------------------------------------------
>
>                 Key: SPARK-8597
>                 URL: https://issues.apache.org/jira/browse/SPARK-8597
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Matt Cheah
>            Priority: Critical
>         Attachments: table.csv
>
>
> I'm running into a strange memory scaling issue when using the partitionBy feature of
DataFrameWriter. 
> I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 different entries,
with size on disk of about 20kb. There are 32 distinct values for column A and 32 distinct
values for column B and all these are combined together (column C will contain a random number
for each row - it doesn't matter) producing a 32*32 elements data set. I've imported this
into Spark and I ran a partitionBy("A", "B") in order to test its performance. This should
create a nested directory structure with 32 folders, each of them containing another 32 folders.
It uses about 10Gb of RAM and it's running slow. If I increase the number of entries in the
table from 32*32 to 128*128, I get Java Heap Space Out Of Memory no matter what value I use
for Heap Space variable.
> Scala code:
> {code}
> var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("table.csv")

> df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)
> {code}
> How I ran the Spark shell:
> {code}
> bin/spark-shell --driver-memory 16g --master local[8] --packages com.databricks:spark-csv_2.10:1.0.3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message