spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Partitioning in spark
Date Fri, 24 Jun 2016 05:09:04 GMT
You can change paralllism like following:

conf = SparkConf()
conf.set('spark.sql.shuffle.partitions',10)
sc = SparkContext(conf=conf)



On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <darshan.meel@gmail.com>
wrote:

> Hi,
>
> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
> each , joined dataframe has 100 partition. I want to know what is the way
> to keep it to 20 (except re-partition and coalesce.
>
> Also, when i join these 2 dataframes I am using 4 columns as joined
> columns. The dataframes are partitions based on first 2 columns of join and
> thus, in effect one partition should be joined corresponding joins and
> doesn't need to join with rest of partitions so why spark is shuffling all
> the data.
>
> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
> it need to sort each partitions and then should grouping there itself.
>
> Bit confusing , I am using 1.5.1
>
> Is it fixed in future versions.
>
> Thanks
>



-- 
Best Regards,
Ayan Guha

Mime
View raw message