spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From thunterdb <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR
Date Fri, 15 Apr 2016 21:07:57 GMT
GitHub user thunterdb opened a pull request:

    https://github.com/apache/spark/pull/12426

    [SPARK-7264][ML] Parallel lapply for sparkR

    ## What changes were proposed in this pull request?
    
    This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function
implements a distributed version of `lapply` using Spark as a backend.
    
    Trivial example in SparkR:
    
    ```R
    sparkLapply(1:5, function(x) { 2 * x })
    ```
    
    Output:
    
    ```
    [[1]]
    [1] 2
    
    [[2]]
    [1] 4
    
    [[3]]
    [1] 6
    
    [[4]]
    [1] 8
    
    [[5]]
    [1] 10
    ```
    
    Here is a slightly more complex example to perform distributed training of multiple models.
Under the hood, Spark broadcasts the dataset.
    
    ```R
    library("MASS")
    data(menarche)
    families <- c("gaussian", "poisson")
    train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
    results <- sparkLapply(families, train)
    ```
    
    ## How was this patch tested?
    
    This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style,
testing, etc. will be much appreciated.
    
    cc @falaki @davies 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thunterdb/spark 7264

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12426
    
----
commit bd73c5b1dc4f5448806d910d2267f4ef6fe5d9eb
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-12T18:38:57Z

    work

commit 0ca109408005d7ed8971e583fec8027ef7c34c9e
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-12T18:56:58Z

    documentation and fixes

commit 0643df2f792ef1d3f0853e6c1a11e17e2301e5da
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-12T20:17:20Z

    style issue

commit 0299d8ba399a299eac6289946d940959a392d070
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-13T14:23:04Z

    comments addressed

commit 745a10385314adc0fc3597802217d9cfe403786c
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-13T14:31:48Z

    jsonify the other parameters

commit a824d9073497853fc8c9f200abceb51801fbeb8e
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-13T14:38:10Z

    style

commit cc8626434c7872b93d1876cccd1d85d99ba43805
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-15T20:49:17Z

    initial commit

commit 1df83cb3001f5397babdbd68bc4e417a0cdcea53
Author: Timothy Hunter <timhunter@databricks.com>
Date:   2016-04-15T21:01:13Z

    not unlisting

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message