spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wim Van Leuven <wim.vanleu...@highestpoint.biz>
Subject Re: Scala vs Python for ETL with Spark
Date Sat, 10 Oct 2020 07:22:56 GMT
Hey Mich,

This is a very fair question .. I've seen many data engineering teams start
out with Scala because technically it is the best choice for many given
reasons and basically it is what Spark is.

On the other hand, almost all use cases we see these days are data science
use cases where people mostly do python. So, if you need those two worlds
collaborate and even handover code, you don't want the ideological battle
of Scala vs Python. We chose python for the sake of everybody speaking the
same language.

But it is true, if you do Spark DataFrames, because then PySpark is a thin
layer around everything on the JVM. Even the discussion of Python UDFs
don't hold up. If it works as a Python function (and most of the time it
does) why do Scala? If however, performance characteristics show you
otherwise, implement those UDFs on the JVM.

Problem with Python? Good engineering practices translated in tools are
much more rare ... a build tool like Maven for Java or SBT for Scala don't
exist ... yet? You can look at PyBuilder for this.

So, referring to the website you mention ... in practice, because of the
many data science use cases out there, I see many Spark shops prefer python
over Scala because Spark gravitates to dataframes where the downsides of
Python do not stack up. Performance of python as a driver program which is
just the glue code, becomes irrelevant compared to the processing you are
doing on the JVM. We even notice that Python is much easier and we hear
echoes that finding (good?) Scala engineers is hard(er).

So, conclusion: Python brings data engineers and data science together. If
you only do data engineering, Scala can be the better choice. It depends on
the context.

Hope this helps
-wim

On Fri, 9 Oct 2020 at 23:27, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> <https://www.kdnuggets.com/2018/05/apache-spark-python-scala.html#:~:text=Scala%20is%20frequently%20over%2010,languages%20are%20faster%20than%20interpreted.>
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer <russell.spitzer@gmail.com>
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Mime
View raw message