GitHub user thunterdb opened a pull request:
https://github.com/apache/spark/pull/12426
[SPARK-7264][ML] Parallel lapply for sparkR
## What changes were proposed in this pull request?
This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function
implements a distributed version of `lapply` using Spark as a backend.
Trivial example in SparkR:
```R
sparkLapply(1:5, function(x) { 2 * x })
```
Output:
```
[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 6
[[4]]
[1] 8
[[5]]
[1] 10
```
Here is a slightly more complex example to perform distributed training of multiple models.
Under the hood, Spark broadcasts the dataset.
```R
library("MASS")
data(menarche)
families <- c("gaussian", "poisson")
train <- function(family){glm(Menarche ~ Age , family=family, data=menarche)}
results <- sparkLapply(families, train)
```
## How was this patch tested?
This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style,
testing, etc. will be much appreciated.
cc @falaki @davies
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/thunterdb/spark 7264
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12426.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12426
----
commit bd73c5b1dc4f5448806d910d2267f4ef6fe5d9eb
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-12T18:38:57Z
work
commit 0ca109408005d7ed8971e583fec8027ef7c34c9e
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-12T18:56:58Z
documentation and fixes
commit 0643df2f792ef1d3f0853e6c1a11e17e2301e5da
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-12T20:17:20Z
style issue
commit 0299d8ba399a299eac6289946d940959a392d070
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-13T14:23:04Z
comments addressed
commit 745a10385314adc0fc3597802217d9cfe403786c
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-13T14:31:48Z
jsonify the other parameters
commit a824d9073497853fc8c9f200abceb51801fbeb8e
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-13T14:38:10Z
style
commit cc8626434c7872b93d1876cccd1d85d99ba43805
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-15T20:49:17Z
initial commit
commit 1df83cb3001f5397babdbd68bc4e417a0cdcea53
Author: Timothy Hunter <timhunter@databricks.com>
Date: 2016-04-15T21:01:13Z
not unlisting
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
|