spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fernando Pereira (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
Date Wed, 10 Jan 2018 22:19:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321234#comment-16321234
] 

Fernando Pereira commented on SPARK-17998:
------------------------------------------

It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try yourself. spark.files.maxPartitionBytes
doesnt work

> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17998
>                 URL: https://issues.apache.org/jira/browse/SPARK-17998
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0, 2.0.1
>         Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>            Reporter: Shea Parkes
>
> Reading a parquet ~file into a DataFrame is resulting in far too few in-memory partitions.
 In prior versions of Spark, the resulting DataFrame would have a number of partitions often
equal to the number of parts in the parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by reading the Parquet
back into memory for me.  Why is it no longer just the number of part files in the Parquet
folder?  (Which is 13 in the example above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with the underlying
RDD without first repartitioning the DataFrame, which is costly and wasteful.  I really doubt
this was the intended effect of moving to Spark 2.0.
> I've tried to research where the number of in-memory partitions is determined, but my
Scala skills have proven in-adequate.  I'd be happy to dig further if someone could point
me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message