spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ferdonline <>
Subject [GitHub] spark pull request #19805: Adding localCheckpoint to Dataframe API
Date Thu, 23 Nov 2017 19:26:25 GMT
GitHub user ferdonline opened a pull request:

    Adding localCheckpoint to Dataframe API

    ## What changes were proposed in this pull request?
    This change adds local checkpoint support to datasets and respective bind from Python
Dataframe API.
    If reliability requirements can be lowered to favor performance, as in cases of further
quick transformations followed by a reliable save, localCheckpoints() fit very well. 
    Furthermore, at the moment Reliable checkpoints still incur double computation (see #9428)
    In general it makes the API more complete as well.
    ## How was this patch tested?
    Python land quick use case:
    In [1]: from time import sleep
    In [2]: from pyspark.sql import types as T
    In [3]: from pyspark.sql import functions as F
    In [4]: def f(x):
        return x*2
    In [5]: df1 = spark.range(30, numPartitions=6)
    In [6]: df2 =, T.LongType())("id"))
    In [7]: %time _ = df2.collect()
    CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms                           
    Wall time: 12.2 s
    In [8]: %time df3 = df2.localCheckpoint()
    CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms                            
    Wall time: 10.3 s
    In [9]: %time _ = df3.collect()
    CPU times: user 5.09 ms, sys: 410 ┬Ás, total: 5.5 ms
    Wall time: 148 ms
    Please review before opening a pull request.

You can merge this pull request into a Git repository by running:

    $ git pull feature_dataset_localCheckpoint

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19805
commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb
Author: Fernando Pereira <>
Date:   2017-11-23T18:49:37Z

    Adding localCheckpoint to Dataframe API



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message