spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semet (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
Date Mon, 04 Jul 2016 15:38:11 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Semet updated SPARK-16367:
--------------------------
    Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to build big fat
jar files. This allows to have all dependencies on one package so the only "cost" is copy
time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use external packages,
and you don't really want to mess with the IT to deploy the packages on the virtualenv of
each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old good ".egg"
files. With a wheel file (".whl"), the package is already prepared for a given architecture.
You can have several wheel, each specific to an architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, how to install
this package in a light speed. Said otherwise, package that requires compilation of a C module,
for instance, does *not* compile anything when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages used for a
given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess
with this compilation and retrieve it directly from pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of view:

- you are writing a PySpark script that increase in term of size and dependencies. Deploying
on Spark for example requires to build numpy or Theano and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard
Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools]
to maintain the requirements.txt
{code}
astroid==1.4.6            # via pylint
autopep8==1.2.4
click==6.6                # via pip-tools
colorama==0.3.7           # via pylint
enum34==1.1.6             # via hypothesis
findspark==1.0.0          # via spark-testing-base
first==2.0.1              # via pip-tools
hypothesis==3.4.0         # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0         # via traceback2
pbr==1.10.0
pep8==1.7.0               # via autopep8
pip-tools==1.6.5
py==1.4.31                # via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2             # via spark-testing-base
six==1.10.0               # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0         # via unittest2
unittest2==1.1.0          # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8             # via astroid
{code}
-- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/]
it makes the jobs of maitaining a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in {{requirements.txt}}, do all
the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for your current
system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be generated into a wheel
and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true"
--conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt"
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"
 ~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through {{--conf}}
argument (this has been directly taken from SPARK-13587). According to the history on spark
source code, I guess the goal is to simplify the maintainance of the various command line
interface, by avoiding too many specific argument.
- the command line is pretty complex indeed. I guess with a proper documentation this might
not be a problem
- you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will
be automatically copied to each node). This is important since this will allow {{pip install}},
running of each node, to pick only the wheels he needs. For example, if you have a package
compiled on 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will only
select the right one
- I have choosen to keep the script at the end of the command line, but for me it is just
a launcher script, it can only be 4 lines:
{code}
/#!/usr/bin/env python	

from mypackage import run
run()
{code}
- on each node, a new virtualenv is created *at each deployment*. This has a cost, but not
so much, since the {{pip install}} will only install wheel, no compilation nor internet connection
will be required. The command line for installing the wheel on each node will be like: 
{code}
pip install --no-index --find-links=/path/to/node/wheelhouse -r requirements.txt
{code}

*advantages*
- quick installation, since there is no compilation
- no Internet connectivity support, no need mess with the corporate proxy or require a local
mirroring of pypi.
- package versionning isolation (two spark job can depends on two different version of a given
library)

*disadvantages*
- creating a virtualenv at each execution takes time, not that much but still it can take
some seconds
- and disk space
- slighly more complex to setup than sending a simple python script, but this feature is not
lost
- support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but one has to send
all wheels flavours and ensure pip is able to install in every environment. The complexity
of this task is on the hands of the developer and no more on the IT persons! (TMHO, this is
an advantage)

*code submission*
I already started working on this point, starting by merging the 2 mergerequests [#5408|https://github.com/apache/spark/pull/5408]
and [#13599|https://github.com/apache/spark/pull/13599]
I'll upload a patch asap for review.
I see two major interogations:
- I don't know that much YARN or MESOS, so I might require some help for the final integration
- documentation should really be carefully crafted so users are not lost in all these concepts

I really think having this "wheelhouse" support for spark will really helps using, maintaining,
and evolving Python scripts on Spark. Python has a rich set of mature libraries Spark should
do anythink to help developers easily access and use them in their everyday job.

  was:
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to build big fat
jar files. This allows to have all dependencies on one package so the only "cost" is copy
time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use external packages,
and you don't really want to mess with the IT to deploy the packages on the virtualenv of
each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old good ".egg"
files. With a wheel file (".whl"), the package is already prepared for a given architecture.
You can have several wheel, each specific to an architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, how to install
this package in a light speed. Said otherwise, package that requires compilation of a C module,
for instance, does *not* compile anything when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages used for a
given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess
with this compilation and retrieve it directly from pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of view:

- you are writing a PySpark script that increase in term of size and dependencies. Deploying
on Spark for example requires to build numpy or Theano and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard
Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools]
to maintain the requirements.txt
{code}
astroid==1.4.6            # via pylint
autopep8==1.2.4
click==6.6                # via pip-tools
colorama==0.3.7           # via pylint
enum34==1.1.6             # via hypothesis
findspark==1.0.0          # via spark-testing-base
first==2.0.1              # via pip-tools
hypothesis==3.4.0         # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0         # via traceback2
pbr==1.10.0
pep8==1.7.0               # via autopep8
pip-tools==1.6.5
py==1.4.31                # via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2             # via spark-testing-base
six==1.10.0               # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0         # via unittest2
unittest2==1.1.0          # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8             # via astroid
{code}
-- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/]
it makes the jobs of maitaining a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in {{requirements.txt}}, do all
the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for your current
system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be generated into a wheel
and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true"
--conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt"
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"
 ~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through {{--conf}}
argument (this has been directly taken from SPARK-13587). According to the history on spark
source code, I guess the goal is to simplify the maintainance of the various command line
interface, by avoiding too many specific argument.
- the command line is pretty complex indeed. I guess with a proper documentation this might
not be a problem
- you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will
be automatically copied to each node). This is important since this will allow {{pip install}},
running of each node, to pick only the wheels he needs. For example, if you have a package
compiled on 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will only
select the right one
- I have choosen to keep the script at the end of the command line, but for me it is just
a launcher script, it can only be 4 lines:
{code}
/#!/usr/bin/env python	

from mypackage import run
run()
{code}
- on each node, a new virtualenv is created *at each deployment*. This has a cost, but not
so much, since the {{pip install}} will only install wheel, no compilation nor internet connection
will be required. The command line for installing the wheel on each node will be like: 
{code}
pip install --no-index --find-links=/path/to/node/wheelhouse -r requirements.txt
{code}

*advantages*
- quick installation, since there is no compilation
- no Internet connectivity support, no need mess with the corporate proxy or require a local
mirroring of pypi.
- package versionning isolation (two spark job can depends on two different version of a given
library)

*disadvantages*
- slighly more complex to setup than sending a simple python script, but this feature is not
lost
- support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but one has to send
all wheels flavours and ensure pip is able to install in every environment. The complexity
of this task is on the hands of the developer and no more on the IT persons! (TMHO, this is
an advantage)

*code submission*
I already started working on this point, starting by merging the 2 mergerequests [#5408|https://github.com/apache/spark/pull/5408]
and [#13599|https://github.com/apache/spark/pull/13599]


> Wheelhouse Support for PySpark
> ------------------------------
>
>                 Key: SPARK-16367
>                 URL: https://issues.apache.org/jira/browse/SPARK-16367
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy, PySpark
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Semet
>              Labels: newbie
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to build big
fat jar files. This allows to have all dependencies on one package so the only "cost" is copy
time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use external
packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv
of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> So here is my proposal:
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that old good ".egg"
files. With a wheel file (".whl"), the package is already prepared for a given architecture.
You can have several wheel, each specific to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, how to install
this package in a light speed. Said otherwise, package that requires compilation of a C module,
for instance, does *not* compile anything when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all packages used for
a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess
with this compilation and retrieve it directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point of view:
> - you are writing a PySpark script that increase in term of size and dependencies. Deploying
on Spark for example requires to build numpy or Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a
standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. You can
use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt
> {code}
> astroid==1.4.6            # via pylint
> autopep8==1.2.4
> click==6.6                # via pip-tools
> colorama==0.3.7           # via pylint
> enum34==1.1.6             # via hypothesis
> findspark==1.0.0          # via spark-testing-base
> first==2.0.1              # via pip-tools
> hypothesis==3.4.0         # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0         # via traceback2
> pbr==1.10.0
> pep8==1.7.0               # via autopep8
> pip-tools==1.6.5
> py==1.4.31                # via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2             # via spark-testing-base
> six==1.10.0               # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0         # via unittest2
> unittest2==1.1.0          # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8             # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/]
it makes the jobs of maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in {{requirements.txt}},
do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for your current
system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be generated into
a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true"
--conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt"
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"
 ~/path/to/launcher_script.py
> {code}
> You can see that:
> - no extra argument is add in the command line. All configuration goes through {{--conf}}
argument (this has been directly taken from SPARK-13587). According to the history on spark
source code, I guess the goal is to simplify the maintainance of the various command line
interface, by avoiding too many specific argument.
> - the command line is pretty complex indeed. I guess with a proper documentation this
might not be a problem
> - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they
will be automatically copied to each node). This is important since this will allow {{pip
install}}, running of each node, to pick only the wheels he needs. For example, if you have
a package compiled on 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}}
will only select the right one
> - I have choosen to keep the script at the end of the command line, but for me it is
just a launcher script, it can only be 4 lines:
> {code}
> /#!/usr/bin/env python	
> from mypackage import run
> run()
> {code}
> - on each node, a new virtualenv is created *at each deployment*. This has a cost, but
not so much, since the {{pip install}} will only install wheel, no compilation nor internet
connection will be required. The command line for installing the wheel on each node will be
like: 
> {code}
> pip install --no-index --find-links=/path/to/node/wheelhouse -r requirements.txt
> {code}
> *advantages*
> - quick installation, since there is no compilation
> - no Internet connectivity support, no need mess with the corporate proxy or require
a local mirroring of pypi.
> - package versionning isolation (two spark job can depends on two different version of
a given library)
> *disadvantages*
> - creating a virtualenv at each execution takes time, not that much but still it can
take some seconds
> - and disk space
> - slighly more complex to setup than sending a simple python script, but this feature
is not lost
> - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but one has
to send all wheels flavours and ensure pip is able to install in every environment. The complexity
of this task is on the hands of the developer and no more on the IT persons! (TMHO, this is
an advantage)
> *code submission*
> I already started working on this point, starting by merging the 2 mergerequests [#5408|https://github.com/apache/spark/pull/5408]
and [#13599|https://github.com/apache/spark/pull/13599]
> I'll upload a patch asap for review.
> I see two major interogations:
> - I don't know that much YARN or MESOS, so I might require some help for the final integration
> - documentation should really be carefully crafted so users are not lost in all these
concepts
> I really think having this "wheelhouse" support for spark will really helps using, maintaining,
and evolving Python scripts on Spark. Python has a rich set of mature libraries Spark should
do anythink to help developers easily access and use them in their everyday job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message