Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAFPEfUvP+0LAvFMd-5oSL5eAiarBquLdpb5=sLSq_4=gqOj1cA@mail.gmail.com>
References: <CAFPEfUvP+0LAvFMd-5oSL5eAiarBquLdpb5=sLSq_4=gqOj1cA@mail.gmail.com>
From: Davies Liu <davies@databricks.com>
Date: Fri, 2 Sep 2016 13:21:51 -0700
Message-ID: <CA+2Pv=i+j_ppmOowDNu3nVyTPp3RPF+bcQDEhm04fSUS_MKrTA@mail.gmail.com>
Subject: Re: Is cache() still necessary for Spark DataFrames?
To: apu <apumishra.rr@gmail.com>
Cc: user <user@spark.apache.org>
Content-Type: text/plain; charset=UTF-8
archived-at: Fri, 02 Sep 2016 20:21:59 -0000

Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that
do not cache the DataFrame, the first() is usually fast enough (only compute the
partitions as needed).

On Fri, Sep 2, 2016 at 1:05 PM, apu <apumishra.rr@gmail.com> wrote:
> When I first learnt Spark, I was told that cache() is desirable anytime one
> performs more than one Action on an RDD or DataFrame. For example, consider
> the PySpark toy example below; it shows two approaches to doing the same
> thing.
>
> # Approach 1 (bad?)
> df2 = someTransformation(df1)
> a = df2.count()
> b = df2.first() # This step could take long, because df2 has to be created
> all over again
>
> # Approach 2 (good?)
> df2 = someTransformation(df1)
> df2.cache()
> a = df2.count()
> b = df2.first() # Because df2 is already cached, this action is quick
> df2.unpersist()
>
> The second approach shown above is somewhat clunky, because it requires one
> to cache any dataframe that will be Acted on more than once, followed by the
> need to call unpersist() later to free up memory.
>
> So my question is: is the second approach still necessary/desirable when
> operating on DataFrames in newer versions of Spark (>=1.6)?
>
> Thanks!!
>
> Apu

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org