spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Zhang" <allenzhang...@126.com>
Subject Re: A proposal for Spark 2.0
Date Tue, 22 Dec 2015 07:18:22 GMT
plus dev






在 2015-12-22 15:15:59,"Allen Zhang" <allenzhang010@126.com> 写道:

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50,"Reynold Xin" <rxin@databricks.com> 写道:

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rxin@databricks.com> wrote:

I’m starting a new thread since the other one got intermixed with feature requests. Please
refrain from making feature request in this thread. Not that we shouldn’t be adding features,
but we can always add features in 1.7, 2.1, 2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major releases in
Spark, based on discussion with several members of the community: a major release should be
low overhead and minimally disruptive to the Spark community. A major release should not be
very different from a minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken in the current
API and remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break substantial API's or
perform large re-architecting that prevent users from upgrading. Spark has always had a culture
of evolving architecture incrementally and making changes - and I don't think we want to change
this model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either
after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21
months since Spark 1.0. A cadence of major releases every 2 years seems doable within the
above model.


Under this model, here is a list of example things I would propose doing in Spark 2.0, separated
into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use
Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable
to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, we have
some DeveloperApi (mostly various listener-related classes) added over the years. Some packages
include only one or two public classes but a lot of private classes. A better structure is
to have public classes isolated to a few public packages, and these public packages should
have minimal private classes for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle differences, these
two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s).
They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar
in order to run Spark.








 
Mime
View raw message