spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <>
Subject Re: How does shuffle work in spark ?
Date Mon, 19 Oct 2015 18:56:58 GMT
I don’t know why it expands to 50 GB but it’s correct to see it both on the first operation
(shuffled write) and on the next one (shuffled read). It’s the barrier between the 2 stages.


From: shahid ashraf
Date: Monday, October 19, 2015 at 9:53 PM
To: Kartik Mathur, Adrian Tanase
Cc: user
Subject: Re: How does shuffle work in spark ?


i don't understand, if original data on partitions is 3.5 G and by doing shuffle to that...
how it expands to 50 GB... and why then it reads 50 GB for next operations.. i have original
data set 0f 100 GB then my data will explode to 1,428.5714286 GBs
and so shuffle reads will be 1,428.5714286 GBs that will be insane

On Mon, Oct 19, 2015 at 11:58 PM, Kartik Mathur <<>>
That sounds like correct shuffle output , in spark map reduce phase is separated by shuffle
, in map each executer writes on local disk and in reduce phase reducerS reads data from each
executer over the network , so shuffle definitely hurts performance , for more details on
spark shuffle phase please read this


On Mon, Oct 19, 2015 at 6:54 AM, shahid <> wrote:
@all i did partitionby using default hash partitioner on data
the total data was approx 3.5 it showed shuffle write 50G and on next action
e.g count it is showing shuffle read of 50 G. i don't understand this
behaviour and i think the performance is getting slow with so much shuffle
read on next tranformation operations.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

with Regards
Shahid Ashraf
View raw message