spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Govoni <dar...@ontrenet.com>
Subject Re: Does pyspark still lag far behind the Scala API in terms of features
Date Wed, 02 Mar 2016 22:22:39 GMT

    
Dataframes are essentially structured tables with schemas. So where does the non typed data
sit before it becomes structured if not in a traditional RDD?
For us almost all the processing comes before there is structure to it.




Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------
From: Nicholas Chammas <nicholas.chammas@gmail.com> 
Date: 03/02/2016  5:13 PM  (GMT-05:00) 
To: Jules Damji <dmatrix@comcast.net>, Joshua Sorrell <jsorr80@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of features 

> However, I believe, investing (or having some members of your group) learn and invest
in Scala is worthwhile for few reasons. One, you will get the performance gain, especially
now with Tungsten (not sure how it relates to Python, but some other knowledgeable people
on the list, please chime in).
The more your workload uses DataFrames, the less of a difference there will be between the
languages (Scala, Java, Python, or R) in terms of performance.
One of the main benefits of Catalyst (which DFs enable) is that it automatically optimizes
DataFrame operations, letting you focus on _what_ you want while Spark will take care of figuring
out _how_.
Tungsten takes things further by tightly managing memory using the type information made available
to it via DataFrames. This benefit comes into play regardless of the language used.
So in short, DataFrames are the "new RDD"--i.e. the new base structure you should be using
in your Spark programs wherever possible. And with DataFrames, what language you use matters
much less in terms of performance.
Nick
On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmatrix@comcast.net> wrote:
Hello Joshua,
comments are inline...

On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsorr80@gmail.com> wrote:
I haven't used Spark in the last year and a half. I am about to start a project with a new
team, and we need to decide whether to use pyspark or Scala.
Indeed, good questions, and they do come up lot in trainings that I have attended, where this
inevitable question is raised.I believe, it depends on your level of comfort zone or adventure
into newer things.
True, for the most part that Apache Spark committers have been committed to keep the APIs
at parity across all the language offerings, even though in some cases, in particular Python,
they have lagged by a minor release. To the the extent that they’re committed to level-parity
is a good sign. It might to be the case with some experimental APIs, where they lag behind,
 but for the most part, they have been admirably consistent. 
With Python there’s a minor performance hit, since there’s an extra level of indirection
in the architecture and an additional Python PID that the executors launch to execute your
pickled Python lambdas. Other than that it boils down to your comfort zone. I recommend looking
at Sameer’s slides on (Advanced Spark for DevOps Training) where he walks through the pySpark
and Python architecture. 

We are NOT a java shop. So some of the build tools/procedures will require some learning overhead
if we go the Scala route. What I want to know is: is the Scala version of Spark still far
enough ahead of pyspark to be well worth any initial training overhead?  
If you are a very advanced Python shop and if you’ve in-house libraries that you have written
in Python that don’t exist in Scala or some ML libs that don’t exist in the Scala version
and will require fair amount of porting and gap is too large, then perhaps it makes sense
to stay put with Python.
However, I believe, investing (or having some members of your group) learn and invest in Scala
is worthwhile for few reasons. One, you will get the performance gain, especially now with
Tungsten (not sure how it relates to Python, but some other knowledgeable people on the list,
please chime in). Two, since Spark is written in Scala, it gives you an enormous advantage
to read sources (which are well documented and highly readable) should you have to consult
or learn nuances of certain API method or action not covered comprehensively in the docs.
And finally, there’s a long term benefit in learning Scala for reasons other than Spark.
For example, writing other scalable and distributed applications.

Particularly, we will be using Spark Streaming. I know a couple of years ago that practically
forced the decision to use Scala.  Is this still the case?
You’ll notice that certain APIs call are not available, at least for now, in Python. http://spark.apache.org/docs/latest/streaming-programming-guide.html

CheersJules
--
The Best Ideas Are Simple
Jules S. Damji
e-mail:dmatrix@comcast.net
e-mail:jules.damji@gmail.com


Mime
View raw message