spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammed Guller <>
Subject RE: dataframes and numPartitions
Date Thu, 15 Oct 2015 23:52:01 GMT
You may find the spark.sql.shuffle.partitions property useful. The default value is 200.


From: Alex Nastetsky []
Sent: Wednesday, October 14, 2015 8:14 PM
To: user
Subject: dataframes and numPartitions

A lot of RDD methods take a numPartitions parameter that lets you specify the number of partitions
in the result. For example, groupByKey.

The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy only takes a
bunch of Columns as params.

I understand that the DataFrame API is supposed to be smarter and go through a LogicalPlan,
and perhaps determine the number of optimal partitions for you, but sometimes you want to
specify the number of partitions yourself. One such use case is when you are preparing to
do a "merge" join with another dataset that is similarly partitioned with the same number
of partitions.
View raw message