spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From szhem <>
Subject [GitHub] spark pull request #19373: [SPARK-22150][CORE] PeriodicCheckpointer fails in...
Date Wed, 27 Sep 2017 21:54:59 GMT
GitHub user szhem opened a pull request:

    [SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependant RDDs

    ## What changes were proposed in this pull request?
    Fix for [SPARK-22150]( JIRA issue.
    In case of checkpointing RDDs which depend on previously checkpointed RDDs (for example
in iterative algorithms) PeriodicCheckpointer removes already checkpointed materialized RDDs
too early leading to FileNotFoundExceptions.
    Consider the following snippet
        // create a periodic checkpointer with interval of 2
        val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)
        val rdd1 = createRDD(sc)
        // on the second update rdd1 is checkpointed
        // on action checkpointed rdd is materialized and its lineage is truncated
        // rdd2 depends on rdd1
        val rdd2 = rdd1.filter(_ => true)
        // on the second update rdd2 is checkpointed and checkpoint files of rdd1 are deleted
        // on action it's necessary to read already removed checkpoint files of rdd1
    ## How was this patch tested?
    Unit tests

You can merge this pull request into a Git repository by running:

    $ git pull SPARK-22150-early-checkpoints

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19373
commit 0c3338cd645f5824f08fe37fd7174e25c416529b
Author: Sergey Zhemzhitsky <>
Date:   2017-09-27T21:33:18Z

    [SPARK-22150][CORE] preventing too early removal of checkpoints in case of dependant RDDs



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message