spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sam (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required
Date Wed, 30 May 2018 08:03:00 GMT
sam created SPARK-24425:
---------------------------

             Summary: Regression from 1.6 to 2.x - Spark no longer respects input partitions,
unnecessary shuffle required
                 Key: SPARK-24425
                 URL: https://issues.apache.org/jira/browse/SPARK-24425
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.2
            Reporter: sam


I think this is a regression. We used to be able to easily control the number of output files
/ tasks based on num files and coalesce. Now I have to use `repartition` to get the desired
num files / partitions which is unnecessarily expensive.

I've tried playing with spark.sql.files.maxPartitionBytes and spark.sql.files.openCostInBytes
to see if I can force the conventional behaviour.
{code:java}
val ss = SparkSession.builder().appName("uber-cp").master(conf.master())
             .config("spark.sql.files.maxPartitionBytes", 1)
             .config("spark.sql.files.openCostInBytes", Long.MaxValue)
{code}
This didn't work. Spark just squashes all my parquet files into less partitions.

Suggest a simple `option` on DataFrameReader that can disable this (or enable it, default
behaviour should be same as 1.6).

 

This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if SPARK-5997 was
implemented this ticket wouldn't really be necessary.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message