spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Scala Vs Python
Date Mon, 05 Sep 2016 22:20:25 GMT
The pertinent question is between "functional programming" and procedural
or OOPs.

I think when you are dealing with data solutions, functional programming is
a more natural way to think and work.


Regards,
Gourav

On Sun, Sep 4, 2016 at 11:17 AM, AssafMendelson <assaf.mendelson@rsa.com>
wrote:

> I don’t have anything off the hand (Unfortunately I didn’t really save it)
> but you can easily make some toy examples.
>
> For example you might do something like defining a simple UDF (e.g. test
> if number < 10)
>
> Then create the function in scala:
>
>
>
> package com.example
>
> import org.apache.spark.sql.functions.udf
>
>
>
> object udfObj extends Serializable {
>
>   def createUDF = {
>
>     udf((x: Int) => x < 10)
>
>   }
>
> }
>
>
>
> Compile the scala and run pyspark with --jars --driver-class-path on the
> created jar.
>
> Inside pyspark do something like:
>
>
>
> from py4j.java_gateway import java_import
>
> from pyspark.sql.column import Column
>
> from pyspark.sql.functions import udf
>
> from pyspark.sql.types import BooleanType
>
> import time
>
>
>
> jvm = sc._gateway.jvm
>
> java_import(jvm, "com.example")
>
> def udf_scala(col):
>
>     return Column(jvm.com.example.udfObj.createUDF().apply(col))
>
>
>
> udf_python = udf(lambda x: x<10, BooleanType())
>
>
>
> df = spark.range(10000000)
>
> df.cache()
>
> df.count()
>
>
>
> df1 = df.filter(df.id < 10)
>
> df2 = df.filter(udf_scala(df.id))
>
> df3 = df.filter(udf_python(df.id))
>
>
>
> t1 = time.time()
>
> df1.count()
>
> t2 = time.time()
>
> df2.count()
>
> t3 = time.time()
>
> df3.count()
>
> t4 = time.time()
>
>
>
> print “time for builtin “ + str(t2-t1)
>
> print “time for scala “ + str(t3-t2)
>
> print “time for python “  + str(t4-t3)
>
>
>
>
>
>
>
> The differences between the times should give you how long it takes (note
> the caching is done in order to make sure we don’t have issues where the
> range is created once and then reused) .
>
> BTW, I saw this can be very touchy in terms of the cluster and its
> configuration. I ran it on two different cluster configurations and ran it
> several times to get some idea on the noise.
>
> Of course, the more complicated the UDF, the less the overhead affects you.
>
> Hope this helps.
>
>                 Assaf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=27651&i=0>]
> *Sent:* Sunday, September 04, 2016 11:00 AM
> *To:* Mendelson, Assaf
> *Cc:* user
> *Subject:* Re: Scala Vs Python
>
>
>
> Hi
>
>
>
> This one is quite interesting. Is it possible to share few toy examples?
>
>
>
> On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=27651&i=1>> wrote:
>
> I am not aware of any official testing but you can easily create your own.
>
> In testing I made I saw that python UDF were more than 10 times slower
> than scala UDF (and in some cases it was closer to 50 times slower).
>
> That said, it would depend on how you use your UDF.
>
> For example, lets say you have a 1 billion row table which you do some
> aggregation on and left with a 10K rows table. If you do the python UDF in
> the beginning then it might have a hard hit but if you do it on the 10K
> rows table then the overhead might be negligible.
>
> Furthermore, you can always write the UDF in scala and wrap it.
>
> This is something my team did. We have data scientists working on spark in
> python. Normally, they can use the existing functions to do what they need
> (Spark already has a pretty nice spread of functions which answer most of
> the common use cases). When they need a new UDF or UDAF they simply ask my
> team (which does the engineering) and we write them a scala one and then
> wrap it to be accessible from python.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http://user/SendEmail.jtp?type=node&node=27650&i=0>]
> *Sent:* Friday, September 02, 2016 12:21 AM
> *To:* kant kodali
> *Cc:* Mendelson, Assaf; user
> *Subject:* Re: Scala Vs Python
>
>
>
> Thanks All for your replies.
>
>
>
> Feature Parity:
>
>
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
>
>
> Performance:
>
> I do  get this bit when RDDs are involved, but not when Data frame is the
> only construct I am operating on.  Dataframe supposed to be
> language-agnostic in terms of performance.  So why people think python is
> slower? is it because of using UDF? Any other reason?
>
>
>
> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
> comparison? like the one out there  b/w RDDs.*
>
>
>
> @Kant:  I am not comparing ANY applications. I am comparing SPARK
> applications only. I would be glad to hear your opinion on why pyspark
> applications will not work, if you have any benchmarks please share if
> possible.
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=27650&i=1>> wrote:
>
> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases
> or Large Scale Distributed Systems makes absolutely no sense. I can write a
> 10 page essay on why that wouldn't work so great. you might be wondering
> why would spark have it then? well probably because its ease of use for ML
> (that would be my best guess).
>
>
>
>
>
> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email]
> <http://user/SendEmail.jtp?type=node&node=27650&i=2> wrote:
>
> I believe this would greatly depend on your use case and your familiarity
> with the languages.
>
>
>
> In general, scala would have a much better performance than python and not
> all interfaces are available in python.
>
> That said, if you are planning to use dataframes without any UDF then the
> performance hit is practically nonexistent.
>
> Even if you need UDF, it is possible to write those in scala and wrap them
> for python and still get away without the performance hit.
>
> Python does not have interfaces for UDAFs.
>
>
>
> I believe that if you have large structured data and do not generally need
> UDF/UDAF you can certainly work in python without losing too much.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http://user/SendEmail.jtp?type=node&node=27637&i=0>]
> *Sent:* Thursday, September 01, 2016 5:03 AM
> *To:* user
> *Subject:* Scala Vs Python
>
>
>
> Hi Users
>
>
>
> Thought to ask (again and again) the question: While I am building any
> production application, should I use Scala or Python?
>
>
>
> I have read many if not most articles but all seems pre-Spark 2. Anything
> changed with Spark 2? Either pro-scala way or pro-python way?
>
>
>
> I am thinking performance, feature parity and future direction, not so
> much in terms of skillset or ease of use.
>
>
>
> Or, if you think it is a moot point, please say so as well.
>
>
>
> Any real life example, production experience, anecdotes, personal taste,
> profanity all are welcome :)
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
> ------------------------------
>
> View this message in context: RE: Scala Vs Python
> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
> ------------------------------
>
> View this message in context: RE: Scala Vs Python
> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27650.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
> ------------------------------
> View this message in context: RE: Scala Vs Python
> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27651.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Mime
View raw message