spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolae Marasoiu <>
Subject Re: Problem understanding spark word count execution
Date Thu, 01 Oct 2015 19:38:07 GMT

So you say " sc.textFile -> flatMap -> Map".

My understanding is like this:
First step is a number of partitions are determined, p of them. You can give hint on this.
Then the nodes which will load partitions p, that is n nodes (where n<=p).

Relatively at the same time or not, the n nodes start opening different sections of the file
- the physical equivalent of the partitions: for instance in HDFS they would do an open and
a seek I guess and just read from the stream there, convert to whatever the InputFormat dictates.

The shuffle can only be the part when a node opens an HDFS file for instance but the node
does not have a local replica of the blocks which it needs to read (those pertaining to his
assigned partitions). So he needs to pick them up from remote nodes which do have replicas
of that data.

After blocks are read into memory, flatMap and Map are local computations generating new RDDs
and in the end the result is sent to the driver (whatever termination computation does on
the RDD like the result of reduce, or side effects of rdd.foreach, etc).

Maybe you can share more of your context if still unclear.
I just made assumptions to give clarity on a similar thing.

From: Kartik Mathur <>
Sent: Thursday, October 1, 2015 10:25 PM
To: Nicolae Marasoiu
Cc: user
Subject: Re: Problem understanding spark word count execution

Thanks Nicolae ,
So In my case all executers are sending results back to the driver and and "shuffle is just
sending out the textFile to distribute the partitions", could you please elaborate on this
 ? what exactly is in this file ?

On Wed, Sep 30, 2015 at 9:57 PM, Nicolae Marasoiu <<>>


2- the end results are sent back to the driver; the shuffles are transmission of intermediate
results between nodes such as the -> which are all intermediate transformations.

More precisely, since flatMap and map are narrow dependencies, meaning they can usually happen
on the local node, I bet shuffle is just sending out the textFile to a few nodes to distribute
the partitions.

From: Kartik Mathur <<>>
Sent: Thursday, October 1, 2015 12:42 AM
To: user
Subject: Problem understanding spark word count execution

Hi All,

I tried running spark word count and I have couple of questions -

I am analyzing stage 0 , i.e
 sc.textFile -> flatMap -> Map (Word count example)

1) In the Stage logs under Application UI details for every task I am seeing Shuffle write
as 2.7 KB, question - how can I know where all did this task write ? like how many bytes to
which executer ?

2) In the executer's log when I look for same task it says 2000 bytes of result is sent to
driver , my question is , if the results were directly sent to driver what is this shuffle
write ?


View raw message