spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cesar Flores <>
Subject Question about data frame partitioning in Spark 1.3.0
Date Wed, 14 Oct 2015 15:45:11 GMT
My current version of spark is 1.3.0 and my question is the next:

I have large data frames where the main field is an user id. I need to do
many group by's and joins using that field. Do the performance will
increase if before doing any group by or join operation I first convert to
rdd to partition by the user id? In other words trying something like the
next lines in all my user data tables will improve the performance in the
long run?:

val partitioned_rdd = unpartitioned_df
   .map(row=>(row.getLong(0), row))
   .partitionBy(new HashPartitioner(200))
   .map(x => x._2)

val partitioned_df = hc.createDataFrame(partitioned_rdd,

Thanks a lot
Cesar Flores

View raw message