spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gavin Yue <yue.yuany...@gmail.com>
Subject Cache after filter Vs Writing back to HDFS
Date Thu, 17 Sep 2015 21:17:41 GMT
For a large dataset, I want to filter out something and then do the
computing intensive work.

What I am doing now:

Data.filter(somerules).cache()
Data.count()

Data.map(timeintensivecompute)

But this sometimes takes unusually long time due to cache missing and
recalculation.

So I changed to this way.

Data.filter.saveasTextFile()

sc.testFile(),map(timeintesivecompute)

Second one is even faster.

How could I tune the job to reach maximum performance?

Thank you.

Mime
View raw message