From dev-return-24400-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Thu Apr  5 01:20:36 2018
Return-Path: <dev-return-24400-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 79CDD18064F
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  5 Apr 2018 01:20:35 +0200 (CEST)
Received: (qmail 85827 invoked by uid 500); 4 Apr 2018 23:20:33 -0000
Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@spark.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@spark.apache.org>
List-Post: <mailto:dev@spark.apache.org>
List-Id: <dev.spark.apache.org>
Delivered-To: mailing list dev@spark.apache.org
Received: (qmail 85812 invoked by uid 99); 4 Apr 2018 23:20:33 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Apr 2018 23:20:33 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id AD92D1A0355
	for <dev@spark.apache.org>; Wed,  4 Apr 2018 23:20:32 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.492
X-Spam-Level: **
X-Spam-Status: No, score=2.492 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URI_HEX=1.313]
	autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=databricks.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id eUBuNq9dSmSw for <dev@spark.apache.org>;
	Wed,  4 Apr 2018 23:20:30 +0000 (UTC)
Received: from mail-it0-f50.google.com (mail-it0-f50.google.com [209.85.214.50])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id D48045FAEB
	for <dev@spark.apache.org>; Wed,  4 Apr 2018 23:20:29 +0000 (UTC)
Received: by mail-it0-f50.google.com with SMTP id 142-v6so642270itl.5
        for <dev@spark.apache.org>; Wed, 04 Apr 2018 16:20:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=databricks.com; s=google;
        h=mime-version:from:date:message-id:subject:to;
        bh=OH2PDiE4Cef7CV3t/JLxA5DvKlllffNnXXWrc8Q++YA=;
        b=i6rk6wg51afeYdimah/YAHuk8fuwdrjRZc5eBZdQ/jjR2/O/h43gRZPRGcCq/9+WkW
         rrBzwhaCrTjuf+1jPJnOl2sad2brz2/wVCdU0Ik4vLGm0Y0954EwN5Q2fkpdQjogtTuX
         obZhQpo6+H1MbA/Beb6xn39W2+1ObvqZQZuS44tZ3yAV7rzLvruKWJsbKry7YfF6w0+P
         BkYRdb1Y55mAC0F57j+pX385IFHT8bXtzUA/eNiS6LCScxested2XgrKAxm/tTN9c29Y
         Fwvo3huJFlblmBG2ffbbkICPA1vQZvoO4gyP6FAVPFI6aatQRroQoH4yyEMxNqPjvZaJ
         SnBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
        bh=OH2PDiE4Cef7CV3t/JLxA5DvKlllffNnXXWrc8Q++YA=;
        b=M+sWnqSx0w19gojM/eGICGUiYq0GZgfAZ0UxV9SjMZfGtzZE+L9AvWHBYJoihZyDFd
         WPdwpYWxCnZ0vdzxFvZJNzk/aurRssa9bjbmol4fe7PR9HrkQGoEMNNximZRO3yLwBxX
         E2iodYsKhWvh9z2E6ThNUr0lhR8I4goeJmWc6fcZCluStFqqpZ29quZhrjT1Loon+Rb/
         U33DBaVFvgDRlWdxU9Dw4nRXK5FnPz8Q5mkQfomlG1Ku/N69AR6grq6v76Gh6KgZxt2g
         qTDoJyVZXiUZstwXGCVe3SwFzUwpHdwPtFXtXhoJ1eHq//IK53Dgr6C9OwTJ/ti0Jwh/
         q3mw==
X-Gm-Message-State: ALQs6tAz6eWvmLDXthHTuHFUwhuJQaiz3qC7q9zqiAM0XQsxvl3wyqHG
	cFSXvXmw2sbj69+fiWCobToFpWsYnuL+OEY3TYsCizy3qdw=
X-Google-Smtp-Source: AIpwx48F+Z8zHqyLfshZ4IxTd0dk3EhTYJmD8feyrthD5yJutEmToD9dD4UuOu1N4EGVrw9HzoCLdTSHTpar4PjV8J0=
X-Received: by 2002:a24:6881:: with SMTP id v123-v6mr11987432itb.32.1522884022475;
 Wed, 04 Apr 2018 16:20:22 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.201.198 with HTTP; Wed, 4 Apr 2018 16:20:01 -0700 (PDT)
From: Reynold Xin <rxin@databricks.com>
Date: Wed, 4 Apr 2018 16:20:01 -0700
Message-ID: <CAPh_B=bnRkXxxx7JF3ZAJnWUOTcTxN6r6301BQEwBXLpnRsNsQ@mail.gmail.com>
Subject: time for Apache Spark 3.0?
To: dev <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary="00000000000031e22a05690e11ef"

--00000000000031e22a05690e11ef
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

There was a discussion thread on scala-contributors
<https://contributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-th=
e-2-12-failure/1747>
about Apache Spark not yet supporting Scala 2.12, and that got me to think
perhaps it is about time for Spark to work towards the 3.0 release. By the
time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark=E2=80=99s history, I want to give=
 more
context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
in 2018.

2. Spark=E2=80=99s versioning policy promises that Spark does not break sta=
ble APIs
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to
3.0).

3. That said, a major version isn=E2=80=99t necessarily the playground for
disruptive API changes to make it painful for users to update. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing
major new features incrementally, so major releases are not the only time
for exciting new features. For example, the bulk of the work in the move
towards the DataFrame API was done in Spark 1.3, and Continuous Processing
was introduced in Spark 2.3. Both were feature releases rather than major
releases.


You can find more background in the thread discussing Spark 2.0:
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Sp=
ark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support
Scala 2.12, which requires minor API breaking changes to Spark=E2=80=99s AP=
Is.
Similar to Spark 2.0, I think there are also opportunities for other
changes that we know have been biting us for a long time but can=E2=80=99t =
be
changed in feature releases (to be clear, I=E2=80=99m actually not sure the=
y are
all good ideas, but I=E2=80=99m writing them down as candidates for conside=
ration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant,
to prevent users from shooting themselves in the foot, e.g. =E2=80=9CSELECT=
 2
SECOND=E2=80=9D -- is =E2=80=9CSECOND=E2=80=9D an interval unit or an alias=
? To make it less
painful for users to upgrade here, I=E2=80=99d suggest creating a flag for =
backward
compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard
compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g.
=E2=80=9CJavaPairRDD flatMapValues requires function returning Iterable, no=
t
Iterator=E2=80=9D, =E2=80=9CPrevent column name duplication in temporary vi=
ew=E2=80=9D).


Now the reality of a major version bump is that the world often thinks in
terms of what exciting features are coming. I do think there are a number
of major changes happening already that can be part of the 3.0 release, if
they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don=E2=80=99t think =
it is
realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...


Similar to the 2.0 discussion, this thread should focus on the framework
and whether it=E2=80=99d make sense to create Spark 3.0 as the next release=
, rather
than the individual feature requests. Those are important but are best done
in their own separate threads.

--00000000000031e22a05690e11ef
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>There was a discussion thread on <a href=3D"https://c=
ontributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-the-2-12-fai=
lure/1747">scala-contributors</a> about Apache Spark not yet supporting Sca=
la 2.12, and that got me to think perhaps it is about time for Spark to wor=
k towards the 3.0 release. By the time it comes out, it will be more than 2=
 years since Spark 2.0.</div><div><br></div><div>For contributors less fami=
liar with Spark=E2=80=99s history, I want to give more context on Spark rel=
eases:</div><div><br></div><blockquote style=3D"margin:0 0 0 40px;border:no=
ne;padding:0px"><div>1. Timeline: Spark 1.0 was released May 2014. Spark 2.=
0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to=
 work on Spark 3.0 in 2018.</div><div><br></div><div>2. Spark=E2=80=99s ver=
sioning policy promises that Spark does not break stable APIs in feature re=
leases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil=
, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).</div><di=
v><br></div><div>3. That said, a major version isn=E2=80=99t necessarily th=
e playground for disruptive API changes to make it painful for users to upd=
ate. The main purpose of a major release is an opportunity to fix things th=
at are broken in the current API and remove certain deprecated APIs.</div><=
div><br></div><div>4. Spark as a project has a culture of evolving architec=
ture and developing major new features incrementally, so major releases are=
 not the only time for exciting new features. For example, the bulk of the =
work in the move towards the DataFrame API was done in Spark 1.3, and Conti=
nuous Processing was introduced in Spark 2.3. Both were feature releases ra=
ther than major releases.</div></blockquote><div><br></div><div>You can fin=
d more background in the thread discussing Spark 2.0: <a href=3D"http://apa=
che-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td=
15122.html">http://apache-spark-developers-list.1001551.n3.nabble.com/A-pro=
posal-for-Spark-2-0-td15122.html</a></div><div><br></div><div><br></div><di=
v>The primary motivating factor IMO for a major version bump is to support =
Scala 2.12, which requires minor API breaking changes to Spark=E2=80=99s AP=
Is. Similar to Spark 2.0, I think there are also opportunities for other ch=
anges that we know have been biting us for a long time but can=E2=80=99t be=
 changed in feature releases (to be clear, I=E2=80=99m actually not sure th=
ey are all good ideas, but I=E2=80=99m writing them down as candidates for =
consideration):</div><div><br></div><blockquote style=3D"margin:0 0 0 40px;=
border:none;padding:0px"><div>1. Support Scala 2.12.</div><div><br></div><d=
iv>2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Sp=
ark 2.x.</div><div><br></div><div>3. Shade all dependencies.</div><div><br>=
</div><div>4. Change the reserved keywords in Spark SQL to be more ANSI-SQL=
 compliant, to prevent users from shooting themselves in the foot, e.g. =E2=
=80=9CSELECT 2 SECOND=E2=80=9D -- is =E2=80=9CSECOND=E2=80=9D an interval u=
nit or an alias? To make it less painful for users to upgrade here, I=E2=80=
=99d suggest creating a flag for backward compatibility mode.</div><div><br=
></div><div>5. Similar to 4, make our type coercion rule in DataFrame/SQL m=
ore standard compliant, and have a flag for backward compatibility.</div><d=
iv><br></div><div>6. Miscellaneous other small changes documented in JIRA a=
lready (e.g. =E2=80=9CJavaPairRDD flatMapValues requires function returning=
 Iterable, not Iterator=E2=80=9D, =E2=80=9CPrevent column name duplication =
in temporary view=E2=80=9D).</div></blockquote><div><br></div><div>Now the =
reality of a major version bump is that the world often thinks in terms of =
what exciting features are coming. I do think there are a number of major c=
hanges happening already that can be part of the 3.0 release, if they make =
it in:</div><div><br></div><blockquote style=3D"margin:0 0 0 40px;border:no=
ne;padding:0px"><div>1. Scala 2.12 support (listing it twice)</div><div>2. =
Continuous Processing non-experimental</div><div>3. Kubernetes support non-=
experimental</div><div>4. A more flushed out version of data source API v2 =
(I don=E2=80=99t think it is realistic to stabilize that in one release)</d=
iv><div>5. Hadoop 3.0 support</div><div>6. ...</div></blockquote><div><br><=
/div><div><br></div><div>Similar to the 2.0 discussion, this thread should =
focus on the framework and whether it=E2=80=99d make sense to create Spark =
3.0 as the next release, rather than the individual feature requests. Those=
 are important but are best done in their own separate threads.</div><div><=
br></div><div><br></div><div><br></div><div><br></div></div>

--00000000000031e22a05690e11ef--