spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juliet Hougland <juliet.hougl...@gmail.com>
Subject Re: [discuss] dropping Python 2.6 support
Date Wed, 06 Jan 2016 01:18:48 GMT
Most admins I talk to about python and spark are already actively (or on
their way to) managing their cluster python installations. Even if people
begin using the system python with pyspark, there is eventually a user who
needs a complex dependency (like pandas or sklearn) on the cluster. No
admin would muck around installing libs into system python, so you end up
with other python installations.

Installing a non-system python is something users intending to use pyspark
on a real cluster should be thinking about, eventually, anyway. It would
work in situations where people are running pyspark locally or actively
managing python installations on a cluster. There is an awkward middle
point where someone has installed spark but not configured their cluster
(by installing non default python) in any other way. Most clusters I see
are RHEL/CentOS and have something other than system python used by spark.

What libraries stopped supporting python 2.6 and where does spark use them?
The "ease of transitioning to pyspark onto a cluster" problem may be an
easier pill to swallow if it only affected something like mllib or spark
sql and not parts of the core api. You end up hoping numpy or pandas are
installed in the runtime components of spark anyway. At that point people
really should just go install a non system python. There are tradeoffs to
using pyspark and I feel pretty fine explaining to people that managing
their cluster's python installations is something that comes with using
pyspark.

RHEL/CentOS is so common that this would probably be a little work for a
lot of people.

--Juliet

On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers <koert@tresata.com> wrote:

> hey evil admin:)
> i think the bit about java was from me?
> if so, i meant to indicate that the reality for us is java is 1.7 on most
> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
> even although java 1.7 is getting old as well it would be a major issue for
> me if spark dropped java 1.7 support.
>
> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken <carlilek@janelia.hhmi.org>
> wrote:
>
>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>> provide quite a few different version of python on our cluster pretty darn
>> easily. All you need is a separate install directory and to set the
>> PYTHON_HOME environment variable to point to the correct python, then have
>> the users make sure the correct python is in their PATH. I understand that
>> other administrators may not be so compliant.
>>
>> Saw a small bit about the java version in there; does Spark currently
>> prefer Java 1.8.x?
>>
>> —Ken
>>
>> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshrosen@databricks.com> wrote:
>>
>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>> while continuing to use a vanilla `python` executable on the executors
>>
>>
>> Whoops, just to be clear, this should actually read "while continuing to
>> use a vanilla `python` 2.7 executable".
>>
>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshrosen@databricks.com>
>> wrote:
>>
>>> Yep, the driver and executors need to have compatible Python versions. I
>>> think that there are some bytecode-level incompatibilities between 2.6 and
>>> 2.7 which would impact the deserialization of Python closures, so I think
>>> you need to be running the same 2.x version for all communicating Spark
>>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>>> driver while continuing to use a vanilla `python` executable on the
>>> executors (we have environment variables which allow you to control these
>>> separately).
>>>
>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> I think all the slaves need the same (or a compatible) version of
>>>> Python installed since they run Python code in PySpark jobs natively.
>>>>
>>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <koert@tresata.com> wrote:
>>>>
>>>>> interesting i didnt know that!
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>>> even if python 2.7 was needed only on this one machine that launches
>>>>>> the app we can not ship it with our software because its gpl licensed
>>>>>>
>>>>>> Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
>>>>>> but not GPL <https://docs.python.org/3/license.html>:
>>>>>>
>>>>>> Note GPL-compatible doesn’t mean that we’re distributing Python
under
>>>>>> the GPL. All Python licenses, unlike the GPL, let you distribute
a modified
>>>>>> version without making your changes open source. The GPL-compatible
>>>>>> licenses make it possible to combine Python with other software that
is
>>>>>> released under the GPL; the others don’t.
>>>>>>
>>>>>> Nick
>>>>>> ​
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <koert@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> i do not think so.
>>>>>>>
>>>>>>> does the python 2.7 need to be installed on all slaves? if so,
we do
>>>>>>> not have direct access to those.
>>>>>>>
>>>>>>> also, spark is easy for us to ship with our software since its
>>>>>>> apache 2 licensed, and it only needs to be present on the machine
that
>>>>>>> launches the app (thanks to yarn).
>>>>>>> even if python 2.7 was needed only on this one machine that launches
>>>>>>> the app we can not ship it with our software because its gpl
licensed, so
>>>>>>> the client would have to download it and install it themselves,
and this
>>>>>>> would mean its an independent install which has to be audited
and approved
>>>>>>> and now you are in for a lot of fun. basically it will never
happen.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshrosen@databricks.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> If users are able to install Spark 2.0 on their RHEL clusters,
then
>>>>>>>> I imagine that they're also capable of installing a standalone
Python
>>>>>>>> alongside that Spark version (without changing Python systemwide).
For
>>>>>>>> instance, Anaconda/Miniconda make it really easy to install
Python
>>>>>>>> 2.7.x/3.x without impacting / changing the system Python
and doesn't
>>>>>>>> require any special permissions to install (you don't need
root / sudo
>>>>>>>> access). Does this address the Python versioning concerns
for RHEL users?
>>>>>>>>
>>>>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <koert@tresata.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yeah, the practical concern is that we have no control
over java
>>>>>>>>> or python version on large company clusters. our current
reality for the
>>>>>>>>> vast majority of them is java 7 and python 2.6, no matter
how outdated that
>>>>>>>>> is.
>>>>>>>>>
>>>>>>>>> i dont like it either, but i cannot change it.
>>>>>>>>>
>>>>>>>>> we currently don't use pyspark so i have no stake in
this, but if
>>>>>>>>> we did i can assure you we would not upgrade to spark
2.x if python 2.6 was
>>>>>>>>> dropped. no point in developing something that doesnt
run for majority of
>>>>>>>>> customers.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> As I pointed out in my earlier email, RHEL will support
Python
>>>>>>>>>> 2.6 until 2020. So I'm assuming these large companies
will have the option
>>>>>>>>>> of riding out Python 2.6 until then.
>>>>>>>>>>
>>>>>>>>>> Are we seriously saying that Spark should likewise
support Python
>>>>>>>>>> 2.6 for the next several years? Even though the core
Python devs stopped
>>>>>>>>>> supporting it in 2013?
>>>>>>>>>>
>>>>>>>>>> If that's not what we're suggesting, then when, roughly,
can we
>>>>>>>>>> drop support? What are the criteria?
>>>>>>>>>>
>>>>>>>>>> I understand the practical concern here. If companies
are stuck
>>>>>>>>>> using 2.6, it doesn't matter to them that it is deprecated.
But balancing
>>>>>>>>>> that concern against the maintenance burden on this
project, I would say
>>>>>>>>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x"
is a reasonable
>>>>>>>>>> position to take. There are many tiny annoyances
one has to put up with to
>>>>>>>>>> support 2.6.
>>>>>>>>>>
>>>>>>>>>> I suppose if our main PySpark contributors are fine
putting up
>>>>>>>>>> with those annoyances, then maybe we don't need to
drop support just yet...
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio
Soto de Vicente <
>>>>>>>>>> julio@esbet.es>님이 작성:
>>>>>>>>>>
>>>>>>>>>>> Unfortunately, Koert is right.
>>>>>>>>>>>
>>>>>>>>>>> I've been in a couple of projects using Spark
(banking industry)
>>>>>>>>>>> where CentOS + Python 2.6 is the toolbox available.
>>>>>>>>>>>
>>>>>>>>>>> That said, I believe it should not be a concern
for Spark.
>>>>>>>>>>> Python 2.6 is old and busted, which is totally
opposite to the Spark
>>>>>>>>>>> philosophy IMO.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <koert@tresata.com>
>>>>>>>>>>> escribió:
>>>>>>>>>>>
>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>>>>>>>>>
>>>>>>>>>>> if so, i still know plenty of large companies
where python 2.6
>>>>>>>>>>> is the only option. asking them for python 2.7
is not going to work
>>>>>>>>>>>
>>>>>>>>>>> so i think its a bad idea
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland
<
>>>>>>>>>>> juliet.hougland@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don't see a reason Spark 2.0 would need
to support Python
>>>>>>>>>>>> 2.6. At this point, Python 3 should be the
default that is encouraged.
>>>>>>>>>>>> Most organizations acknowledge the 2.7 is
common, but lagging
>>>>>>>>>>>> behind the version they should theoretically
use. Dropping python 2.6
>>>>>>>>>>>> support sounds very reasonable to me.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas
Chammas <
>>>>>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5
until 2020
>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient
history and the core Python
>>>>>>>>>>>>> developers stopped supporting it in 2013.
REHL 5 is not a good enough
>>>>>>>>>>>>> reason to continue support for Python
2.6 IMO.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We should aim to support Python 2.7 and
Python 3.3+ (which I
>>>>>>>>>>>>> believe we currently do).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen
Zhang <
>>>>>>>>>>>>> allenzhang010@126.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> plus 1,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> we are currently using python 2.7.2
in production environment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu
Mathew" <
>>>>>>>>>>>>>> meethu.mathew@flytxt.com> 写道:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>> We use Python 2.7
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meethu Mathew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM,
Reynold Xin <
>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does anybody here care about
us dropping support for Python
>>>>>>>>>>>>>>> 2.6 in Spark 2.0?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Python 2.6 is ancient, and is
pretty slow in many aspects
>>>>>>>>>>>>>>> (e.g. json parsing) when compared
with Python 2.7. Some libraries that
>>>>>>>>>>>>>>> Spark depend on stopped supporting
2.6. We can still convince the library
>>>>>>>>>>>>>>> maintainers to support 2.6, but
it will be extra work. I'm curious if
>>>>>>>>>>>>>>> anybody still uses Python 2.6
to run Spark.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>>
>

Mime
View raw message