spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From szhem <...@git.apache.org>
Subject [GitHub] spark pull request #19373: [SPARK-22150][CORE] PeriodicCheckpointer fails in...
Date Wed, 27 Sep 2017 21:54:59 GMT
GitHub user szhem opened a pull request:

    https://github.com/apache/spark/pull/19373

    [SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependant RDDs

    ## What changes were proposed in this pull request?
    
    Fix for [SPARK-22150](https://issues.apache.org/jira/browse/SPARK-22150) JIRA issue.
    
    In case of checkpointing RDDs which depend on previously checkpointed RDDs (for example
in iterative algorithms) PeriodicCheckpointer removes already checkpointed materialized RDDs
too early leading to FileNotFoundExceptions.
    
    Consider the following snippet
    
        // create a periodic checkpointer with interval of 2
        val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)
        
        val rdd1 = createRDD(sc)
        checkpointer.update(rdd1)
        // on the second update rdd1 is checkpointed
        checkpointer.update(rdd1)
        // on action checkpointed rdd is materialized and its lineage is truncated
        rdd1.count() 
        
        // rdd2 depends on rdd1
        val rdd2 = rdd1.filter(_ => true)
        checkpointer.update(rdd2)
        // on the second update rdd2 is checkpointed and checkpoint files of rdd1 are deleted
        checkpointer.update(rdd2)
        // on action it's necessary to read already removed checkpoint files of rdd1
        rdd2.count()
    
    ## How was this patch tested?
    
    Unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/szhem/spark SPARK-22150-early-checkpoints

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19373.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19373
    
----
commit 0c3338cd645f5824f08fe37fd7174e25c416529b
Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>
Date:   2017-09-27T21:33:18Z

    [SPARK-22150][CORE] preventing too early removal of checkpoints in case of dependant RDDs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message