spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanbo Liang (JIRA)" <>
Subject [jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.
Date Fri, 14 Oct 2016 14:48:20 GMT


Yanbo Liang commented on SPARK-17904:

[~felixcheung] I think the proposal I made in this JIRA is different from what you mentioned,
I think they are two different scenarios. R users may call install.packges across the session,
rather than installing all necessary libraries before they start the session. From the discussion,
I found it's not easy to support this feature. Like what [~shivaram]'s suggestion, we can
try to add required packages when we start the session. This can satisfy parts of users' requirements,
but not all. 
All in all, I appreciate all your comments. Thanks.

> Add a wrapper function to install R packages on each executors.
> ---------------------------------------------------------------
>                 Key: SPARK-17904
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR
>            Reporter: Yanbo Liang
> SparkR provides {{spark.lappy}} to run local R functions in distributed environment,
and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed into {{spark.lappy}}
or {{dapply}}, they should install required R packages on each executor in advance.
> To install dependent R packages on each executors and check it successfully, we can run
similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The detail implementation
should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party library,
since SparkR is an interactive analytics tools, users may call lots of libraries during the
analytics session. In native R, users can run {{install.packages()}} and {{library()}} across
the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can install
dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a local/hdfs path,
then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to install from local
> Since SparkR has its own library directories where to install the packages on each executor,
so I think it will not pollute the native R environment. I'd like to know whether it make
sense, and feel free to correct me if there is misunderstanding.  

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message