spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Shuffle intermidiate results not being cached
Date Mon, 26 Dec 2016 14:07:59 GMT
Shuffle results are only reused if you are reusing the exact same RDD.  If
you are working with Dataframes that you have not explicitly cached, then
they are going to be producing new RDDs within their physical plan creation
and evaluation, so you won't get implicit shuffle reuse.  This is what
https://issues.apache.org/jira/browse/SPARK-11838 is about.

On Mon, Dec 26, 2016 at 5:56 AM, assaf.mendelson <assaf.mendelson@rsa.com>
wrote:

> Hi,
>
>
>
> Sorry to be bothering everyone on the holidays but I have found what may
> be a bug.
>
>
>
> I am doing a “manual” streaming (see http://stackoverflow.com/
> questions/41266956/apache-spark-streaming-performance for the specific
> code) where I essentially read an additional dataframe each time from file,
> union it with previous dataframes to create a “window” and then do double
> aggregation on the result.
>
> Having looked at the documentation (https://spark.apache.org/
> docs/latest/programming-guide.html#which-storage-level-to-choose right
> above the headline) I expected spark to automatically cache the partial
> aggregation for each dataframe read and then continue with the aggregations
> from there. Instead it seems it reads each dataframe from file all over
> again.
>
> Is this a bug? Am I doing something wrong?
>
>
>
> Thanks.
>
>                 Assaf.
>
> ------------------------------
> View this message in context: Shuffle intermidiate results not being
> cached
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Shuffle-intermidiate-results-not-being-cached-tp20358.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

Mime
View raw message