spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jatinganhotra <>
Subject Checkpointing calls the job twice?
Date Sun, 18 Oct 2015 03:38:37 GMT

I noticed that when you checkpoint a given RDD, it results in performing the
action twice as I can see 2 jobs being executed in the Spark UI.

val logFile = "/data/pagecounts"
val logData = sc.textFile(logFile, 2)
val as = logData.filter(line => line.contains("a"))

Scenario #1:
as.count()        // Only 1 job.

But, if I change the above code to below:

Scenario #2:

Here, there are 2 jobs being executed as shown in the Spark UI, with
duration 0.9s and 0.4s

Why are there 2 jobs in scenario #2? In Spark source code, the comment for
RDD.checkpoint() says the following - 
"This function must be called before any job has been executed on this RDD.
It is strongly recommended that this RDD is persisted in memory, otherwise
saving it on a file will require recompilation."

In my example above, I am calling cache() before checkpoint(), so RDD will
be persisted in memory. Also, both of the above calls are before the count()
action, so checkpoint() is called before any job execution.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message