spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Höring (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
Date Mon, 17 Sep 2018 11:37:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Fabian Höring updated SPARK-25433:
----------------------------------
    Description: 
The goal of this ticket is to ship and use custom code inside the spark executors. 

This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also
works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide
nice entry points to spark-submit and SparkContext but zipping your local virtual env and
then just changing the PYSPARK_PYTHON should already work.

I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
But recreating the virtual env each time doesn't seem to be a very scalable solution. If you
have hundreds of executors it will retrieve the package from each excecutor and recreate your
virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily shippable to
another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
which makes it very complicated for the user to ship the virtual env and be sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create
a single executable zip file with all dependencies included. You have the pex command line
tool to build your package and when it is built you are sure it works. This is in my opinion
the most elegant way to ship python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one single entry
point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't
work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark executors. 

This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also
works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide
nice entry points to spark-submit and SparkContext but zipping your local virtual env and
then just changing the PYSPARK_PYTHON should already work.

I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
But recreating the virtual env each time doesn't seem to be a very scalable solution. If you
have hundreds of executors it will retrieve the package from each excecutor and recreate your
virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily shippable to
another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable]
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
which makes it very complicated for the user to ship the virtual env and be sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create
a single executable zip file with all dependencies included. You have the pex command line
tool to build your package and when it is built you are sure it works. This is in my opinion
the most elegant way to ship python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one single entry
point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't
work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for PEX in PySpark
> ------------------------------
>
>                 Key: SPARK-25433
>                 URL: https://issues.apache.org/jira/browse/SPARK-25433
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.2
>            Reporter: Fabian Höring
>            Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark executors. 
> This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:
> Basically the workflow is
>  * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack]
also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 ticket to
provide nice entry points to spark-submit and SparkContext but zipping your local virtual
env and then just changing the PYSPARK_PYTHON should already work.
> I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
But recreating the virtual env each time doesn't seem to be a very scalable solution. If you
have hundreds of executors it will retrieve the package from each excecutor and recreate your
virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily shippable
to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
which makes it very complicated for the user to ship the virtual env and be sure it works.
> And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way
to create a single executable zip file with all dependencies included. You have the pex command
line tool to build your package and when it is built you are sure it works. This is in my
opinion the most elegant way to ship python code (better than virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one single entry
point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't
work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message