spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shahid ashraf <sha...@trialx.com>
Subject SHUFFLE in PARTITIONBY or shuffle in general
Date Mon, 19 Oct 2015 11:16:38 GMT
Hi Folks

i am not able to understand the shuffle in paritionby
I am doing partitionby(hashparitioning on int) to repartition the data as
of data skew. see screen shot below, after doing the
partitionby(repartitioning ) why is the shuffle so high* 50 GB for only 3GB
data and why is shuffle read so high? after that for collect() to a task
which is action of geting total counts of records in each partition the
shuffle read is 50GB. Also as counts before partitioning took only 19s see
stage 5 vs stage 10 in screenshot.*

​what is shuffle read and shuffle write in partiton by task.


-- 
with Regards
Shahid Ashraf

Mime
View raw message