spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Uang <justin.u...@gmail.com>
Subject Re: PySpark on PyPi
Date Thu, 20 Aug 2015 21:44:21 GMT
I would prefer to just do it without the jar first as well. My hunch is
that to run spark the way it is intended, we need the wrapper scripts, like
spark-submit. Does anyone know authoritatively if that is the case?

On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> +1
> But just to improve the error logging,
> would it be possible to add some warn logging in pyspark when the
> SPARK_HOME env variable is pointing to a Spark distribution with a
> different version from the pyspark package ?
>
> Regards,
>
> Olivier.
>
> 2015-08-20 22:43 GMT+02:00 Brian Granger <ellisonbg@gmail.com>:
>
>> I would start with just the plain python package without the JAR and
>> then see if it makes sense to add the JAR over time.
>>
>> On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez <auberon.lopez@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I wanted to bubble up a conversation from the PR to this discussion to
>> see
>> > if there is support the idea of including a Spark assembly JAR in a PyPI
>> > release of pyspark. @holdenk recommended this as she already does so in
>> the
>> > Sparkling Pandas package. Is this something people are interesting in
>> > pursuing?
>> >
>> > -Auberon
>> >
>> > On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger <ellisonbg@gmail.com>
>> wrote:
>> >>
>> >> Auberon, can you also post this to the Jupyter Google Group?
>> >>
>> >> On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <
>> auberon.lopez@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I've created an updated PR for this based off of the previous work
of
>> >> > @prabinb:
>> >> > https://github.com/apache/spark/pull/8318
>> >> >
>> >> > I am not very familiar with python packaging; feedback is
>> appreciated.
>> >> >
>> >> > -Auberon
>> >> >
>> >> > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <benjaminrk@gmail.com>
>> wrote:
>> >> >>
>> >> >>
>> >> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <meawoppl@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I would tentatively suggest also conda packaging.
>> >> >>
>> >> >>
>> >> >> A conda package has the advantage that it can be set up without
>> >> >> 'installing' the pyspark files, while the PyPI packaging is still
>> being
>> >> >> worked out. It can just add a pyspark.pth file pointing to pyspark,
>> >> >> py4j
>> >> >> locations. But I think it's a really good idea to package with
>> conda.
>> >> >>
>> >> >> -MinRK
>> >> >>
>> >> >>>
>> >> >>>
>> >> >>> http://conda.pydata.org/docs/
>> >> >>>
>> >> >>> --Matthew Goodman
>> >> >>>
>> >> >>> =====================
>> >> >>> Check Out My Website: http://craneium.net
>> >> >>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> >> >>>
>> >> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <
>> davies@databricks.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> I think so, any contributions on this are welcome.
>> >> >>>>
>> >> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <
>> ellisonbg@gmail.com>
>> >> >>>> wrote:
>> >> >>>> > Sorry, trying to follow the context here. Does it
look like
>> there
>> >> >>>> > is
>> >> >>>> > support for the idea of creating a setup.py file and
pypi
>> package
>> >> >>>> > for
>> >> >>>> > pyspark?
>> >> >>>> >
>> >> >>>> > Cheers,
>> >> >>>> >
>> >> >>>> > Brian
>> >> >>>> >
>> >> >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <
>> davies@databricks.com>
>> >> >>>> > wrote:
>> >> >>>> >> We could do that after 1.5 released, it will have
same release
>> >> >>>> >> cycle
>> >> >>>> >> as Spark in the future.
>> >> >>>> >>
>> >> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >> >>>> >> <o.girardot@lateral-thoughts.com> wrote:
>> >> >>>> >>> +1 (once again :) )
>> >> >>>> >>>
>> >> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.uang@gmail.com
>> >:
>> >> >>>> >>>>
>> >> >>>> >>>> // ping
>> >> >>>> >>>>
>> >> >>>> >>>> do we have any signoff from the pyspark
devs to submit a PR
>> to
>> >> >>>> >>>> publish to
>> >> >>>> >>>> PyPI?
>> >> >>>> >>>>
>> >> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy
Freeman
>> >> >>>> >>>> <freeman.jeremy@gmail.com>
>> >> >>>> >>>> wrote:
>> >> >>>> >>>>>
>> >> >>>> >>>>> Hey all, great discussion, just wanted
to +1 that I see a
>> lot
>> >> >>>> >>>>> of
>> >> >>>> >>>>> value in
>> >> >>>> >>>>> steps that make it easier to use PySpark
as an ordinary
>> python
>> >> >>>> >>>>> library.
>> >> >>>> >>>>>
>> >> >>>> >>>>> You might want to check out this
>> >> >>>> >>>>> (https://github.com/minrk/findspark),
>> >> >>>> >>>>> started by Jupyter project devs, that
offers one way to
>> >> >>>> >>>>> facilitate
>> >> >>>> >>>>> this
>> >> >>>> >>>>> stuff. I’ve also cced them here
to join the conversation.
>> >> >>>> >>>>>
>> >> >>>> >>>>> Also, @Jey, I can also confirm that
at least in some
>> scenarios
>> >> >>>> >>>>> (I’ve done
>> >> >>>> >>>>> it in an EC2 cluster in standalone
mode) it’s possible to
>> run
>> >> >>>> >>>>> PySpark jobs
>> >> >>>> >>>>> just using `from pyspark import SparkContext;
sc =
>> >> >>>> >>>>> SparkContext(master=“X”)`
>> >> >>>> >>>>> so long as the environmental variables
(PYTHONPATH and
>> >> >>>> >>>>> PYSPARK_PYTHON) are
>> >> >>>> >>>>> set correctly on *both* workers and
driver. That said,
>> there’s
>> >> >>>> >>>>> definitely
>> >> >>>> >>>>> additional configuration / functionality
that would require
>> >> >>>> >>>>> going
>> >> >>>> >>>>> through
>> >> >>>> >>>>> the proper submit scripts.
>> >> >>>> >>>>>
>> >> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka
Biswal
>> >> >>>> >>>>> <punya.biswal@gmail.com>
>> >> >>>> >>>>> wrote:
>> >> >>>> >>>>>
>> >> >>>> >>>>> I agree with everything Justin just
said. An additional
>> >> >>>> >>>>> advantage
>> >> >>>> >>>>> of
>> >> >>>> >>>>> publishing PySpark's Python code in
a standards-compliant
>> way
>> >> >>>> >>>>> is
>> >> >>>> >>>>> the fact
>> >> >>>> >>>>> that we'll be able to declare transitive
dependencies
>> (Pandas,
>> >> >>>> >>>>> Py4J) in a
>> >> >>>> >>>>> way that pip can use. Contrast this
with the current
>> situation,
>> >> >>>> >>>>> where
>> >> >>>> >>>>> df.toPandas() exists in the Spark
API but doesn't actually
>> work
>> >> >>>> >>>>> until you
>> >> >>>> >>>>> install Pandas.
>> >> >>>> >>>>>
>> >> >>>> >>>>> Punya
>> >> >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin
Uang
>> >> >>>> >>>>> <justin.uang@gmail.com>
>> >> >>>> >>>>> wrote:
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> // + Davies for his comments
>> >> >>>> >>>>>> // + Punya for SA
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> For development and CI, like Olivier
mentioned, I think it
>> >> >>>> >>>>>> would
>> >> >>>> >>>>>> be
>> >> >>>> >>>>>> hugely beneficial to publish pyspark
(only code in the
>> python/
>> >> >>>> >>>>>> dir) on PyPI.
>> >> >>>> >>>>>> If anyone wants to develop against
PySpark APIs, they need
>> to
>> >> >>>> >>>>>> download the
>> >> >>>> >>>>>> distribution and do a lot of PYTHONPATH
munging for all the
>> >> >>>> >>>>>> tools
>> >> >>>> >>>>>> (pylint,
>> >> >>>> >>>>>> pytest, IDE code completion).
Right now that involves
>> adding
>> >> >>>> >>>>>> python/ and
>> >> >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip.
In case pyspark ever
>> wants to
>> >> >>>> >>>>>> add more
>> >> >>>> >>>>>> dependencies, we would have to
manually mirror all the
>> >> >>>> >>>>>> PYTHONPATH
>> >> >>>> >>>>>> munging in
>> >> >>>> >>>>>> the ./pyspark script. With a proper
pyspark setup.py which
>> >> >>>> >>>>>> declares its
>> >> >>>> >>>>>> dependencies, and a published
distribution, depending on
>> >> >>>> >>>>>> pyspark
>> >> >>>> >>>>>> will just
>> >> >>>> >>>>>> be adding pyspark to my setup.py
dependencies.
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> Of course, if we actually want
to run parts of pyspark
>> that is
>> >> >>>> >>>>>> backed by
>> >> >>>> >>>>>> Py4J calls, then we need the full
spark distribution with
>> >> >>>> >>>>>> either
>> >> >>>> >>>>>> ./pyspark
>> >> >>>> >>>>>> or ./spark-submit, but for things
like linting and
>> >> >>>> >>>>>> development,
>> >> >>>> >>>>>> the
>> >> >>>> >>>>>> PYTHONPATH munging is very annoying.
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> I don't think the version-mismatch
issues are a compelling
>> >> >>>> >>>>>> reason
>> >> >>>> >>>>>> to not
>> >> >>>> >>>>>> go ahead with PyPI publishing.
At runtime, we should
>> >> >>>> >>>>>> definitely
>> >> >>>> >>>>>> enforce that
>> >> >>>> >>>>>> the version has to be exact, which
means there is no
>> >> >>>> >>>>>> backcompat
>> >> >>>> >>>>>> nightmare as
>> >> >>>> >>>>>> suggested by Davies in
>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >> >>>> >>>>>> This would mean that even if the
user got his pip installed
>> >> >>>> >>>>>> pyspark to
>> >> >>>> >>>>>> somehow get loaded before the
spark distribution provided
>> >> >>>> >>>>>> pyspark, then the
>> >> >>>> >>>>>> user would be alerted immediately.
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> Davies, if you buy this, should
me or someone on my team
>> pick
>> >> >>>> >>>>>> up
>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267
and
>> >> >>>> >>>>>> https://github.com/apache/spark/pull/464?
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM
Olivier Girardot
>> >> >>>> >>>>>> <o.girardot@lateral-thoughts.com>
wrote:
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Ok, I get it. Now what can
we do to improve the current
>> >> >>>> >>>>>>> situation,
>> >> >>>> >>>>>>> because right now if I want
to set-up a CI env for
>> PySpark, I
>> >> >>>> >>>>>>> have to :
>> >> >>>> >>>>>>> 1- download a pre-built version
of pyspark and unzip it
>> >> >>>> >>>>>>> somewhere on
>> >> >>>> >>>>>>> every agent
>> >> >>>> >>>>>>> 2- define the SPARK_HOME env
>> >> >>>> >>>>>>> 3- symlink this distribution
pyspark dir inside the python
>> >> >>>> >>>>>>> install dir
>> >> >>>> >>>>>>> site-packages/ directory
>> >> >>>> >>>>>>> and if I rely on additional
packages (like databricks'
>> >> >>>> >>>>>>> Spark-CSV
>> >> >>>> >>>>>>> project), I have to (except
if I'm mistaken)
>> >> >>>> >>>>>>> 4- compile/assembly spark-csv,
deploy the jar in a
>> specific
>> >> >>>> >>>>>>> directory
>> >> >>>> >>>>>>> on every agent
>> >> >>>> >>>>>>> 5- add this jar-filled directory
to the Spark
>> distribution's
>> >> >>>> >>>>>>> additional
>> >> >>>> >>>>>>> classpath using the conf/spark-default
file
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Then finally we can launch
our unit/integration-tests.
>> >> >>>> >>>>>>> Some issues are related to
spark-packages, some to the
>> lack
>> >> >>>> >>>>>>> of
>> >> >>>> >>>>>>> python-based dependency, and
some to the way SparkContext
>> are
>> >> >>>> >>>>>>> launched when
>> >> >>>> >>>>>>> using pyspark.
>> >> >>>> >>>>>>> I think step 1 and 2 are fair
enough
>> >> >>>> >>>>>>> 4 and 5 may already have solutions,
I didn't check and
>> >> >>>> >>>>>>> considering
>> >> >>>> >>>>>>> spark-shell is downloading
such dependencies
>> automatically, I
>> >> >>>> >>>>>>> think if
>> >> >>>> >>>>>>> nothing's done yet it will
(I guess ?).
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> For step 3, maybe just adding
a setup.py to the
>> distribution
>> >> >>>> >>>>>>> would be
>> >> >>>> >>>>>>> enough, I'm not exactly advocating
to distribute a full
>> 300Mb
>> >> >>>> >>>>>>> spark
>> >> >>>> >>>>>>> distribution in PyPi, maybe
there's a better compromise ?
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Regards,
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Olivier.
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12,
Jey Kottalam
>> >> >>>> >>>>>>> <jey@cs.berkeley.edu>
>> >> >>>> >>>>>>> a écrit
>> >> >>>> >>>>>>> :
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> Couldn't we have a pip
installable "pyspark" package that
>> >> >>>> >>>>>>>> just
>> >> >>>> >>>>>>>> serves
>> >> >>>> >>>>>>>> as a shim to an existing
Spark installation? Or it could
>> >> >>>> >>>>>>>> even
>> >> >>>> >>>>>>>> download the
>> >> >>>> >>>>>>>> latest Spark binary if
SPARK_HOME isn't set during
>> >> >>>> >>>>>>>> installation. Right now,
>> >> >>>> >>>>>>>> Spark doesn't play very
well with the usual Python
>> >> >>>> >>>>>>>> ecosystem.
>> >> >>>> >>>>>>>> For example,
>> >> >>>> >>>>>>>> why do I need to use a
strange incantation when booting
>> up
>> >> >>>> >>>>>>>> IPython if I want
>> >> >>>> >>>>>>>> to use PySpark in a notebook
with MASTER="local[4]"? It
>> >> >>>> >>>>>>>> would
>> >> >>>> >>>>>>>> be much nicer
>> >> >>>> >>>>>>>> to just type `from pyspark
import SparkContext; sc =
>> >> >>>> >>>>>>>> SparkContext("local[4]")`
in my notebook.
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> I did a test and it seems
like PySpark's basic
>> unit-tests do
>> >> >>>> >>>>>>>> pass when
>> >> >>>> >>>>>>>> SPARK_HOME is set and
Py4J is on the PYTHONPATH:
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> >> >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> -Jey
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> On Fri, Jun 5, 2015 at
10:57 AM, Josh Rosen
>> >> >>>> >>>>>>>> <rosenville@gmail.com>
>> >> >>>> >>>>>>>> wrote:
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>> This has been proposed
before:
>> >> >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>> There's currently
tighter coupling between the Python
>> and
>> >> >>>> >>>>>>>>> Java
>> >> >>>> >>>>>>>>> halves
>> >> >>>> >>>>>>>>> of PySpark than just
requiring SPARK_HOME to be set; if
>> we
>> >> >>>> >>>>>>>>> did
>> >> >>>> >>>>>>>>> this, I bet
>> >> >>>> >>>>>>>>> we'd run into tons
of issues when users try to run a
>> newer
>> >> >>>> >>>>>>>>> version of the
>> >> >>>> >>>>>>>>> Python half of PySpark
against an older set of Java
>> >> >>>> >>>>>>>>> components
>> >> >>>> >>>>>>>>> or
>> >> >>>> >>>>>>>>> vice-versa.
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>> On Thu, Jun 4, 2015
at 10:45 PM, Olivier Girardot
>> >> >>>> >>>>>>>>> <o.girardot@lateral-thoughts.com>
wrote:
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> Hi everyone,
>> >> >>>> >>>>>>>>>> Considering the
python API as just a front needing the
>> >> >>>> >>>>>>>>>> SPARK_HOME
>> >> >>>> >>>>>>>>>> defined anyway,
I think it would be interesting to
>> deploy
>> >> >>>> >>>>>>>>>> the
>> >> >>>> >>>>>>>>>> Python part of
>> >> >>>> >>>>>>>>>> Spark on PyPi
in order to handle the dependencies in a
>> >> >>>> >>>>>>>>>> Python
>> >> >>>> >>>>>>>>>> project
>> >> >>>> >>>>>>>>>> needing PySpark
via pip.
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> For now I just
symlink the python/pyspark in my python
>> >> >>>> >>>>>>>>>> install dir
>> >> >>>> >>>>>>>>>> site-packages/
in order for PyCharm or other lint
>> tools to
>> >> >>>> >>>>>>>>>> work properly.
>> >> >>>> >>>>>>>>>> I can do the setup.py
work or anything.
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> What do you think
?
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> Regards,
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> Olivier.
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>
>> >> >>>> >>>
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Brian E. Granger
>> >> >>>> > Cal Poly State University, San Luis Obispo
>> >> >>>> > @ellisonbg on Twitter and GitHub
>> >> >>>> > bgranger@calpoly.edu and ellisonbg@gmail.com
>> >> >>>>
>> >> >>>>
>> ---------------------------------------------------------------------
>> >> >>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Brian E. Granger
>> >> Associate Professor of Physics and Data Science
>> >> Cal Poly State University, San Luis Obispo
>> >> @ellisonbg on Twitter and GitHub
>> >> bgranger@calpoly.edu and ellisonbg@gmail.com
>> >
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Associate Professor of Physics and Data Science
>> Cal Poly State University, San Luis Obispo
>> @ellisonbg on Twitter and GitHub
>> bgranger@calpoly.edu and ellisonbg@gmail.com
>>
>
>
>
> --
> *Olivier Girardot* | Associé
> o.girardot@lateral-thoughts.com
> +33 6 24 09 17 94
>

Mime
View raw message