spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sun Rui (JIRA)" <>
Subject [jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
Date Fri, 09 Sep 2016 01:49:20 GMT


Sun Rui commented on SPARK-17428:

I don't understand the meaning of exact version control. I think a user can specify downloaded
R packages or specify a package name and version, and SparkR can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no complication. The python
interpreter abstracts the underly architecture differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have
an option --compile-on-workers allowing users to choose to compile on worker nodes. If the
option is specified, users should ensure the compiling tool chain be ready on worker nodes.

> SparkR executors/workers support virtualenv
> -------------------------------------------
>                 Key: SPARK-17428
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR
>            Reporter: Yanbo Liang
> Many users have requirements to use third party R packages in executors/workers, but
SparkR can not satisfy this requirements elegantly. For example, you should to mess with the
IT/administrators of the cluster to deploy these R packages on each executors/workers node
which is very inflexible.
> I think we should support third party R packages for SparkR users as what we do for jar
packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support virtualenv
like Python conda. I have investigated and found packrat(
is one of the candidates to support virtualenv for R. Packrat is a dependency management system
for R and can isolate the dependent R packages in its own private package space. Then SparkR
users can install third party packages in the application scope(destroy after the application
exit) and don’t need to bother IT/administrators to install these packages manually.
> I would like to know whether it make sense.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message