spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Python Spark Improvements (forked from Spark Improvement Proposals)
Date Thu, 13 Oct 2016 01:59:50 GMT
I'd add one item to this list: The lack of Python 3 support in Spark
Packages <https://github.com/databricks/sbt-spark-package/issues/26>. This
means that great packages like GraphFrames cannot be used with Python 3
<https://github.com/graphframes/graphframes/issues/85>.

This is quite disappointing since Spark itself supports Python 3 and since
-- at least in my circles -- Python 3 adoption is reaching a tipping point.
All new Python projects at my company and at my friends' companies are
being written in Python 3.

Nick


On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <holden@pigscanfly.ca> wrote:

> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: image.png]
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Mime
View raw message