spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <>
Subject [jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
Date Wed, 14 Dec 2016 05:30:58 GMT


Jeff Zhang commented on SPARK-13587:

[] I don't understand how this can work without python installed
in the worker nodes. And regarding the overhead, I think no matter we dist the dependency
or download the dependency. The overhead can not avoided. From my experience, one advantage
of the downloading approach is that it would cache the dependencies on the worker node. So
if the executor runs on one node that the dependency is cached there, it would save lot of
time to set up the virtualenv. 

> Support virtualenv in PySpark
> -----------------------------
>                 Key: SPARK-13587
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
> Currently, it's not easy for user to add third party python packages in pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not suitable for
complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and not easy to
switch to different environment)
> Python has now 2 different virtualenv implementation. One is native virtualenv another
is through conda. This jira is trying to migrate these 2 tools to distributed environment

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message