spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ferdonline <...@git.apache.org>
Subject [GitHub] spark pull request #19805: Adding localCheckpoint to Dataframe API
Date Thu, 23 Nov 2017 19:26:25 GMT
GitHub user ferdonline opened a pull request:

    https://github.com/apache/spark/pull/19805

    Adding localCheckpoint to Dataframe API

    ## What changes were proposed in this pull request?
    
    This change adds local checkpoint support to datasets and respective bind from Python
Dataframe API.
    
    If reliability requirements can be lowered to favor performance, as in cases of further
quick transformations followed by a reliable save, localCheckpoints() fit very well. 
    Furthermore, at the moment Reliable checkpoints still incur double computation (see #9428)
    In general it makes the API more complete as well.
    
    ## How was this patch tested?
    
    Python land quick use case:
    
    ```python
    In [1]: from time import sleep
    
    In [2]: from pyspark.sql import types as T
    
    In [3]: from pyspark.sql import functions as F
    
    In [4]: def f(x):
        sleep(1)
        return x*2
       ...: 
    
    In [5]: df1 = spark.range(30, numPartitions=6)
    
    In [6]: df2 = df1.select(F.udf(f, T.LongType())("id"))
    
    In [7]: %time _ = df2.collect()
    CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms                           
    Wall time: 12.2 s
    
    In [8]: %time df3 = df2.localCheckpoint()
    CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms                            
    Wall time: 10.3 s
    
    In [9]: %time _ = df3.collect()
    CPU times: user 5.09 ms, sys: 410 ┬Ás, total: 5.5 ms
    Wall time: 148 ms
    ```
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ferdonline/spark feature_dataset_localCheckpoint

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19805
    
----
commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb
Author: Fernando Pereira <fernando.pereira@epfl.ch>
Date:   2017-11-23T18:49:37Z

    Adding localCheckpoint to Dataframe API

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message