spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Is cache() still necessary for Spark DataFrames?
Date Fri, 02 Sep 2016 20:21:51 GMT
Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that
do not cache the DataFrame, the first() is usually fast enough (only compute the
partitions as needed).

On Fri, Sep 2, 2016 at 1:05 PM, apu <apumishra.rr@gmail.com> wrote:
> When I first learnt Spark, I was told that cache() is desirable anytime one
> performs more than one Action on an RDD or DataFrame. For example, consider
> the PySpark toy example below; it shows two approaches to doing the same
> thing.
>
> # Approach 1 (bad?)
> df2 = someTransformation(df1)
> a = df2.count()
> b = df2.first() # This step could take long, because df2 has to be created
> all over again
>
> # Approach 2 (good?)
> df2 = someTransformation(df1)
> df2.cache()
> a = df2.count()
> b = df2.first() # Because df2 is already cached, this action is quick
> df2.unpersist()
>
> The second approach shown above is somewhat clunky, because it requires one
> to cache any dataframe that will be Acted on more than once, followed by the
> need to call unpersist() later to free up memory.
>
> So my question is: is the second approach still necessary/desirable when
> operating on DataFrames in newer versions of Spark (>=1.6)?
>
> Thanks!!
>
> Apu

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message