spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juliet Hougland <juliet.hougl...@gmail.com>
Subject Re: [discuss] dropping Python 2.6 support
Date Thu, 07 Jan 2016 19:55:38 GMT
@ Reynold Xin @Josh Rosen: What is current maintenance burden of supporting
Python 2.6? What libraries are no longer supporting Python 2.6 and where
does Spark use them?


On Tue, Jan 5, 2016 at 5:40 PM, Jeff Zhang <zjffdu@gmail.com> wrote:

> +1
>
> On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland <juliet.hougland@gmail.com
> > wrote:
>
>> Most admins I talk to about python and spark are already actively (or on
>> their way to) managing their cluster python installations. Even if people
>> begin using the system python with pyspark, there is eventually a user who
>> needs a complex dependency (like pandas or sklearn) on the cluster. No
>> admin would muck around installing libs into system python, so you end up
>> with other python installations.
>>
>> Installing a non-system python is something users intending to use
>> pyspark on a real cluster should be thinking about, eventually, anyway. It
>> would work in situations where people are running pyspark locally or
>> actively managing python installations on a cluster. There is an awkward
>> middle point where someone has installed spark but not configured their
>> cluster (by installing non default python) in any other way. Most clusters
>> I see are RHEL/CentOS and have something other than system python used by
>> spark.
>>
>> What libraries stopped supporting python 2.6 and where does spark use
>> them? The "ease of transitioning to pyspark onto a cluster" problem may be
>> an easier pill to swallow if it only affected something like mllib or spark
>> sql and not parts of the core api. You end up hoping numpy or pandas are
>> installed in the runtime components of spark anyway. At that point people
>> really should just go install a non system python. There are tradeoffs to
>> using pyspark and I feel pretty fine explaining to people that managing
>> their cluster's python installations is something that comes with using
>> pyspark.
>>
>> RHEL/CentOS is so common that this would probably be a little work for a
>> lot of people.
>>
>> --Juliet
>>
>> On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> hey evil admin:)
>>> i think the bit about java was from me?
>>> if so, i meant to indicate that the reality for us is java is 1.7 on
>>> most (all?) clusters. i do not believe spark prefers java 1.8. my point was
>>> that even although java 1.7 is getting old as well it would be a major
>>> issue for me if spark dropped java 1.7 support.
>>>
>>> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken <carlilek@janelia.hhmi.org>
>>> wrote:
>>>
>>>> As one of the evil administrators that runs a RHEL 6 cluster, we
>>>> already provide quite a few different version of python on our cluster
>>>> pretty darn easily. All you need is a separate install directory and to set
>>>> the PYTHON_HOME environment variable to point to the correct python, then
>>>> have the users make sure the correct python is in their PATH. I understand
>>>> that other administrators may not be so compliant.
>>>>
>>>> Saw a small bit about the java version in there; does Spark currently
>>>> prefer Java 1.8.x?
>>>>
>>>> —Ken
>>>>
>>>> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshrosen@databricks.com>
>>>> wrote:
>>>>
>>>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>>>> while continuing to use a vanilla `python` executable on the executors
>>>>
>>>>
>>>> Whoops, just to be clear, this should actually read "while continuing
>>>> to use a vanilla `python` 2.7 executable".
>>>>
>>>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshrosen@databricks.com>
>>>> wrote:
>>>>
>>>>> Yep, the driver and executors need to have compatible Python versions.
>>>>> I think that there are some bytecode-level incompatibilities between
2.6
>>>>> and 2.7 which would impact the deserialization of Python closures, so
I
>>>>> think you need to be running the same 2.x version for all communicating
>>>>> Spark processes. Note that you _can_ use a Python 2.7 `ipython` executable
>>>>> on the driver while continuing to use a vanilla `python` executable on
the
>>>>> executors (we have environment variables which allow you to control these
>>>>> separately).
>>>>>
>>>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>>> I think all the slaves need the same (or a compatible) version of
>>>>>> Python installed since they run Python code in PySpark jobs natively.
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <koert@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> interesting i didnt know that!
>>>>>>>
>>>>>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>
>>>>>>>> even if python 2.7 was needed only on this one machine that
>>>>>>>> launches the app we can not ship it with our software because
its gpl
>>>>>>>> licensed
>>>>>>>>
>>>>>>>> Not to nitpick, but maybe this is important. The Python license
is GPL-compatible
>>>>>>>> but not GPL <https://docs.python.org/3/license.html>:
>>>>>>>>
>>>>>>>> Note GPL-compatible doesn’t mean that we’re distributing
Python
>>>>>>>> under the GPL. All Python licenses, unlike the GPL, let you
distribute a
>>>>>>>> modified version without making your changes open source.
The
>>>>>>>> GPL-compatible licenses make it possible to combine Python
with other
>>>>>>>> software that is released under the GPL; the others don’t.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>> ​
>>>>>>>>
>>>>>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <koert@tresata.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> i do not think so.
>>>>>>>>>
>>>>>>>>> does the python 2.7 need to be installed on all slaves?
if so, we
>>>>>>>>> do not have direct access to those.
>>>>>>>>>
>>>>>>>>> also, spark is easy for us to ship with our software
since its
>>>>>>>>> apache 2 licensed, and it only needs to be present on
the machine that
>>>>>>>>> launches the app (thanks to yarn).
>>>>>>>>> even if python 2.7 was needed only on this one machine
that
>>>>>>>>> launches the app we can not ship it with our software
because its gpl
>>>>>>>>> licensed, so the client would have to download it and
install it
>>>>>>>>> themselves, and this would mean its an independent install
which has to be
>>>>>>>>> audited and approved and now you are in for a lot of
fun. basically it will
>>>>>>>>> never happen.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <
>>>>>>>>> joshrosen@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> If users are able to install Spark 2.0 on their RHEL
clusters,
>>>>>>>>>> then I imagine that they're also capable of installing
a standalone Python
>>>>>>>>>> alongside that Spark version (without changing Python
systemwide). For
>>>>>>>>>> instance, Anaconda/Miniconda make it really easy
to install Python
>>>>>>>>>> 2.7.x/3.x without impacting / changing the system
Python and doesn't
>>>>>>>>>> require any special permissions to install (you don't
need root / sudo
>>>>>>>>>> access). Does this address the Python versioning
concerns for RHEL users?
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <koert@tresata.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> yeah, the practical concern is that we have no
control over java
>>>>>>>>>>> or python version on large company clusters.
our current reality for the
>>>>>>>>>>> vast majority of them is java 7 and python 2.6,
no matter how outdated that
>>>>>>>>>>> is.
>>>>>>>>>>>
>>>>>>>>>>> i dont like it either, but i cannot change it.
>>>>>>>>>>>
>>>>>>>>>>> we currently don't use pyspark so i have no stake
in this, but
>>>>>>>>>>> if we did i can assure you we would not upgrade
to spark 2.x if python 2.6
>>>>>>>>>>> was dropped. no point in developing something
that doesnt run for majority
>>>>>>>>>>> of customers.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas
<
>>>>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> As I pointed out in my earlier email, RHEL
will support Python
>>>>>>>>>>>> 2.6 until 2020. So I'm assuming these large
companies will have the option
>>>>>>>>>>>> of riding out Python 2.6 until then.
>>>>>>>>>>>>
>>>>>>>>>>>> Are we seriously saying that Spark should
likewise support
>>>>>>>>>>>> Python 2.6 for the next several years? Even
though the core Python devs
>>>>>>>>>>>> stopped supporting it in 2013?
>>>>>>>>>>>>
>>>>>>>>>>>> If that's not what we're suggesting, then
when, roughly, can we
>>>>>>>>>>>> drop support? What are the criteria?
>>>>>>>>>>>>
>>>>>>>>>>>> I understand the practical concern here.
If companies are stuck
>>>>>>>>>>>> using 2.6, it doesn't matter to them that
it is deprecated. But balancing
>>>>>>>>>>>> that concern against the maintenance burden
on this project, I would say
>>>>>>>>>>>> that "upgrade to Python 2.7 or stay on Spark
1.6.x" is a reasonable
>>>>>>>>>>>> position to take. There are many tiny annoyances
one has to put up with to
>>>>>>>>>>>> support 2.6.
>>>>>>>>>>>>
>>>>>>>>>>>> I suppose if our main PySpark contributors
are fine putting up
>>>>>>>>>>>> with those annoyances, then maybe we don't
need to drop support just yet...
>>>>>>>>>>>>
>>>>>>>>>>>> Nick
>>>>>>>>>>>> 2016년 1월 5일 (화) 오후 2:27, Julio
Antonio Soto de Vicente <
>>>>>>>>>>>> julio@esbet.es>님이 작성:
>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately, Koert is right.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've been in a couple of projects using
Spark (banking
>>>>>>>>>>>>> industry) where CentOS + Python 2.6 is
the toolbox available.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That said, I believe it should not be
a concern for Spark.
>>>>>>>>>>>>> Python 2.6 is old and busted, which is
totally opposite to the Spark
>>>>>>>>>>>>> philosophy IMO.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers
<koert@tresata.com>
>>>>>>>>>>>>> escribió:
>>>>>>>>>>>>>
>>>>>>>>>>>>> rhel/centos 6 ships with python 2.6,
doesnt it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> if so, i still know plenty of large companies
where python 2.6
>>>>>>>>>>>>> is the only option. asking them for python
2.7 is not going to work
>>>>>>>>>>>>>
>>>>>>>>>>>>> so i think its a bad idea
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet
Hougland <
>>>>>>>>>>>>> juliet.hougland@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't see a reason Spark 2.0 would
need to support Python
>>>>>>>>>>>>>> 2.6. At this point, Python 3 should
be the default that is encouraged.
>>>>>>>>>>>>>> Most organizations acknowledge the
2.7 is common, but lagging
>>>>>>>>>>>>>> behind the version they should theoretically
use. Dropping python 2.6
>>>>>>>>>>>>>> support sounds very reasonable to
me.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas
Chammas <
>>>>>>>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Red Hat supports Python 2.6 on
REHL 5 until 2020
>>>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>>>>>>>>>>>>>> but otherwise yes, Python 2.6
is ancient history and the core Python
>>>>>>>>>>>>>>> developers stopped supporting
it in 2013. REHL 5 is not a good enough
>>>>>>>>>>>>>>> reason to continue support for
Python 2.6 IMO.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We should aim to support Python
2.7 and Python 3.3+ (which I
>>>>>>>>>>>>>>> believe we currently do).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM
Allen Zhang <
>>>>>>>>>>>>>>> allenzhang010@126.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> plus 1,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> we are currently using python
2.7.2 in production
>>>>>>>>>>>>>>>> environment.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu
Mathew" <
>>>>>>>>>>>>>>>> meethu.mathew@flytxt.com>
写道:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>> We use Python 2.7
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Meethu Mathew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47
PM, Reynold Xin <
>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Does anybody here care
about us dropping support for
>>>>>>>>>>>>>>>>> Python 2.6 in Spark 2.0?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Python 2.6 is ancient,
and is pretty slow in many aspects
>>>>>>>>>>>>>>>>> (e.g. json parsing) when
compared with Python 2.7. Some libraries that
>>>>>>>>>>>>>>>>> Spark depend on stopped
supporting 2.6. We can still convince the library
>>>>>>>>>>>>>>>>> maintainers to support
2.6, but it will be extra work. I'm curious if
>>>>>>>>>>>>>>>>> anybody still uses Python
2.6 to run Spark.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Mime
View raw message