spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <>
Subject Re: PySpark on PyPi
Date Fri, 05 Jun 2015 22:46:49 GMT
Ok, I get it. Now what can we do to improve the current situation, because
right now if I want to set-up a CI env for PySpark, I have to :
1- download a pre-built version of pyspark and unzip it somewhere on every
2- define the SPARK_HOME env
3- symlink this distribution pyspark dir inside the python install dir
site-packages/ directory
and if I rely on additional packages (like databricks' Spark-CSV project),
I have to (except if I'm mistaken)
4- compile/assembly spark-csv, deploy the jar in a specific directory on
every agent
5- add this jar-filled directory to the Spark distribution's additional
classpath using the conf/spark-default file

Then finally we can launch our unit/integration-tests.
Some issues are related to spark-packages, some to the lack of python-based
dependency, and some to the way SparkContext are launched when using
I think step 1 and 2 are fair enough
4 and 5 may already have solutions, I didn't check and considering
spark-shell is downloading such dependencies automatically, I think if
nothing's done yet it will (I guess ?).

For step 3, maybe just adding a to the distribution would be
enough, I'm not exactly advocating to distribute a full 300Mb spark
distribution in PyPi, maybe there's a better compromise ?



Le ven. 5 juin 2015 à 22:12, Jey Kottalam <> a écrit :

> Couldn't we have a pip installable "pyspark" package that just serves as a
> shim to an existing Spark installation? Or it could even download the
> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
> Spark doesn't play very well with the usual Python ecosystem. For example,
> why do I need to use a strange incantation when booting up IPython if I
> want to use PySpark in a notebook with MASTER="local[4]"? It would be much
> nicer to just type `from pyspark import SparkContext; sc =
> SparkContext("local[4]")` in my notebook.
> I did a test and it seems like PySpark's basic unit-tests do pass when
> SPARK_HOME is set and Py4J is on the PYTHONPATH:
> python $SPARK_HOME/python/pyspark/
> -Jey
> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <> wrote:
>> This has been proposed before:
>> There's currently tighter coupling between the Python and Java halves of
>> PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>> we'd run into tons of issues when users try to run a newer version of the
>> Python half of PySpark against an older set of Java components or
>> vice-versa.
>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot <
>>> wrote:
>>> Hi everyone,
>>> Considering the python API as just a front needing the SPARK_HOME
>>> defined anyway, I think it would be interesting to deploy the Python part
>>> of Spark on PyPi in order to handle the dependencies in a Python project
>>> needing PySpark via pip.
>>> For now I just symlink the python/pyspark in my python install dir
>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>> I can do the work or anything.
>>> What do you think ?
>>> Regards,
>>> Olivier.

View raw message