spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Nastetsky <>
Subject dataframes and numPartitions
Date Thu, 15 Oct 2015 03:14:03 GMT
A lot of RDD methods take a numPartitions parameter that lets you specify
the number of partitions in the result. For example, groupByKey.

The DataFrame counterparts don't have a numPartitions parameter, e.g.
groupBy only takes a bunch of Columns as params.

I understand that the DataFrame API is supposed to be smarter and go
through a LogicalPlan, and perhaps determine the number of optimal
partitions for you, but sometimes you want to specify the number of
partitions yourself. One such use case is when you are preparing to do a
"merge" join with another dataset that is similarly partitioned with the
same number of partitions.

View raw message