spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: A proposal for Spark 2.0
Date Thu, 26 Nov 2015 21:01:09 GMT
I don't think there are any plan for Scala 2.12 support yet. We can always
add Scala 2.12 support later.


On Thu, Nov 26, 2015 at 12:59 PM, Koert Kuipers <koert@tresata.com> wrote:

> I also thought the idea was to drop 2.10. Do we want to cross build for 3
> scala versions?
> On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sandy.ryza@cloudera.com> wrote:
>
>> I see.  My concern is / was that cluster operators will be reluctant to
>> upgrade to 2.0, meaning that developers using those clusters need to stay
>> on 1.x, and, if they want to move to DataFrames, essentially need to port
>> their app twice.
>>
>> I misunderstood and thought part of the proposal was to drop support for
>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>> will make it less palatable to cluster administrators than releases in the
>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>
>> -Sandy
>>
>>
>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaharia@gmail.com>
>> wrote:
>>
>>> What are the other breaking changes in 2.0 though? Note that we're not
>>> removing Scala 2.10, we're just making the default build be against Scala
>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>> worry about. If people are going to update their apps, I think it's better
>>> to make the other small changes in 2.0 at the same time than to update once
>>> for Dataset and another time for 2.0.
>>>
>>> BTW just refer to Reynold's original post for the other proposed API
>>> changes.
>>>
>>> Matei
>>>
>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.ryza@cloudera.com>
>>> wrote:
>>>
>>> I think that Kostas' logic still holds.  The majority of Spark users,
>>> and likely an even vaster majority of people running vaster jobs, are still
>>> on RDDs and on the cusp of upgrading to DataFrames.  Users will probably
>>> want to upgrade to the stable version of the Dataset / DataFrame API so
>>> they don't need to do so twice.  Requiring that they absorb all the other
>>> ways that Spark breaks compatibility in the move to 2.0 makes it much more
>>> difficult for them to make this transition.
>>>
>>> Using the same set of APIs also means that it will be easier to backport
>>> critical fixes to the 1.x line.
>>>
>>> It's not clear to me that avoiding breakage of an experimental API in
>>> the 1.x line outweighs these issues.
>>>
>>> -Sandy
>>>
>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>>> reason is that I already know we have to break some part of the
>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>>> sooner (in one release) than later (in two releases). so the damage is
>>>> smaller.
>>>>
>>>> I don't think whether we call Dataset/DataFrame experimental or not
>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>
>>>>
>>>>
>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <mark@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>> fixing.  We're on the same page now.
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kostas@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs
>>>>>> in z releases. The Dataset API is experimental and so we might be
changing
>>>>>> the APIs before we declare it stable. This is why I think it is important
>>>>>> to first stabilize the Dataset API with a Spark 1.7 release before
moving
>>>>>> to Spark 2.0. This will benefit users that would like to use the
new
>>>>>> Dataset APIs but can't move to Spark 2.0 because of the backwards
>>>>>> incompatible changes, like removal of deprecated APIs, Scala 2.11
etc.
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>>> mark@clearstorydata.com> wrote:
>>>>>>
>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>> instead of 1.6.1?
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>>> kostas@cloudera.com> wrote:
>>>>>>>
>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here
- yes
>>>>>>>> we can talk about RDD vs. DS/DF more but lets refocus on
Spark 2.0. I'd
>>>>>>>> like to propose we have one more 1.x release after Spark
1.6. This will
>>>>>>>> allow us to stabilize a few of the new features that were
added in 1.6:
>>>>>>>>
>>>>>>>> 1) the experimental Datasets API
>>>>>>>> 2) the new unified memory manager.
>>>>>>>>
>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>> but there will be users that won't be able to seamlessly
upgrade given what
>>>>>>>> we have discussed as in scope for 2.0. For these users, having
a 1.x
>>>>>>>> release with these new features/APIs stabilized will be very
beneficial.
>>>>>>>> This might make Spark 1.7 a lighter release but that is not
necessarily a
>>>>>>>> bad thing.
>>>>>>>>
>>>>>>>> Any thoughts on this timeline?
>>>>>>>>
>>>>>>>> Kostas Sakellis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.cheng@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Agree, more features/apis/optimization need to be added
in DF/DS.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I mean, we need to think about what kind of RDD APIs
we have to
>>>>>>>>> provide to developer, maybe the fundamental API is enough,
like, the
>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not
in this category, as
>>>>>>>>> we can do the same thing easily with DF/DS, even better
performance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>>> *To:* Stephen Boesch
>>>>>>>>>
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmmm... to me, that seems like precisely the kind of
thing that
>>>>>>>>> argues for retaining the RDD API but not as the first
thing presented to
>>>>>>>>> new Spark developers: "Here's how to use groupBy with
DataFrames.... Until
>>>>>>>>> the optimizer is more fully developed, that won't always
get you the best
>>>>>>>>> performance that can be obtained.  In these particular
circumstances, ...,
>>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <javadba@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My understanding is that  the RDD's presently have more
support
>>>>>>>>> for complete control of partitioning which is a key consideration
at
>>>>>>>>> scale.  While partitioning control is still piecemeal
in  DF/DS  it would
>>>>>>>>> seem premature to make RDD's a second-tier approach to
spark dev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> An example is the use of groupBy when we know that the
source
>>>>>>>>> relation (/RDD) is already partitioned on the grouping
expressions.  AFAIK
>>>>>>>>> the spark sql still does not allow that knowledge to
be applied to the
>>>>>>>>> optimizer - so a full shuffle will be performed. However
in the native RDD
>>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <mark@clearstorydata.com>:
>>>>>>>>>
>>>>>>>>> The place of the RDD API in 2.0 is also something I've
been
>>>>>>>>> wondering about.  I think it may be going too far to
deprecate it, but
>>>>>>>>> changing emphasis is something that we might consider.
 The RDD API came
>>>>>>>>> well before DataFrames and DataSets, so programming guides,
introductory
>>>>>>>>> how-to articles and the like have, to this point, also
tended to emphasize
>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm
thinking is that
>>>>>>>>> with 2.0 maybe we should overhaul all the documentation
to de-emphasize and
>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets
would be
>>>>>>>>> introduced and fully addressed before RDDs.  They would
be presented as the
>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs,
in contrast,
>>>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal
API
>>>>>>>>> that can be used in atypical, more specialized contexts
where DataFrames or
>>>>>>>>> DataSets don't fully fit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.cheng@intel.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I am not sure what the best practice for this specific
problem,
>>>>>>>>> but it’s really worth to think about it in 2.0, as
it is a painful issue
>>>>>>>>> for lots of users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the way, is it also an opportunity to deprecate the
RDD API (or
>>>>>>>>> internal API only?)? As lots of its functionality overlapping
with
>>>>>>>>> DataFrame or DataSet.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>>>> *To:* Nicholas Chammas
>>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com;
>>>>>>>>> dev@spark.apache.org; Reynold Xin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I know we want to keep breaking changes to a minimum
but I'm
>>>>>>>>> hoping that with Spark 2.0 we can also look at better
classpath isolation
>>>>>>>>> with user programs. I propose we build on
>>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it
true by default, and
>>>>>>>>> not allow any spark transitive dependencies to leak into
user code. For
>>>>>>>>> backwards compatibility we can have a whitelist if we
want but I'd be good
>>>>>>>>> if we start requiring user apps to explicitly pull in
all their
>>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also
moving in this
>>>>>>>>> direction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kostas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to
move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current
structure of
>>>>>>>>> two separate machine learning packages seems to be somewhat
confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate
the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow
GraphX evolve with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>> On that note of deprecating stuff, it might be good to
deprecate
>>>>>>>>> some things in 2.0 without removing or replacing them
immediately. That way
>>>>>>>>> 2.0 doesn’t have to wait for everything that we want
to deprecate to be
>>>>>>>>> replaced all at once.
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> ​
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>>>>>
>>>>>>>>> Parameter Server is a new feature and thus does not match
the goal
>>>>>>>>> of 2.0 is “to fix things that are broken in the current
API and remove
>>>>>>>>> certain deprecated APIs”. At the same time I would
be happy to have that
>>>>>>>>> feature.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to
move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current
structure of
>>>>>>>>> two separate machine learning packages seems to be somewhat
confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate
the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow
GraphX evolve with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards, Alexander
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>>>> *To:* witgo@qq.com
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being specific to Parameter Server, I think the current
agreement
>>>>>>>>> is that PS shall exist as a third-party library instead
of a component of
>>>>>>>>> the core code base, isn’t?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Nan Zhu
>>>>>>>>>
>>>>>>>>> http://codingcat.me
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com
wrote:
>>>>>>>>>
>>>>>>>>> Who has the idea of machine learning? Spark missing some
features
>>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaharia@gmail.com>
写道:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I like the idea of popping out Tachyon to an optional
component
>>>>>>>>> too to reduce the number of dependencies. In the future,
it might even be
>>>>>>>>> useful to do this for Hadoop, but it requires too many
API changes to be
>>>>>>>>> worth doing now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regarding Scala 2.12, we should definitely support it
eventually,
>>>>>>>>> but I don't think we need to block 2.0 on that because
it can be added
>>>>>>>>> later too. Has anyone investigated what it would take
to run on there? I
>>>>>>>>> imagine we don't need many code changes, just maybe some
REPL stuff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>>> releases as undisruptive as possible in the model Reynold
proposed. Keeping
>>>>>>>>> everyone working with the same set of releases is super
important.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <sowen@cloudera.com>
wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rxin@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> to the Spark community. A major release should not be
very
>>>>>>>>> different from a
>>>>>>>>>
>>>>>>>>> minor release and should not be gated based on new features.
The
>>>>>>>>> main
>>>>>>>>>
>>>>>>>>> purpose of a major release is an opportunity to fix things
that
>>>>>>>>> are broken
>>>>>>>>>
>>>>>>>>> in the current API and remove certain deprecated APIs
(examples
>>>>>>>>> follow).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agree with this stance. Generally, a major release might
also be a
>>>>>>>>>
>>>>>>>>> time to replace some big old API or implementation with
a new one,
>>>>>>>>> but
>>>>>>>>>
>>>>>>>>> I don't see obvious candidates.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than
later, unless
>>>>>>>>>
>>>>>>>>> there's a fairly good reason to continue adding features
in 1.x to
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned
big.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Scala 2.11 as the default build. We should still support
Scala
>>>>>>>>> 2.10, but
>>>>>>>>>
>>>>>>>>> it has been end-of-life.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version,
2.11
>>>>>>>>> will
>>>>>>>>>
>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while.
I'd
>>>>>>>>> propose
>>>>>>>>>
>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0
and 2.1 were
>>>>>>>>>
>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm sure we'll think of a number of other small things
-- shading a
>>>>>>>>>
>>>>>>>>> bunch of stuff? reviewing and updating dependencies in
light of
>>>>>>>>>
>>>>>>>>> simpler, more recent dependencies to support from Hadoop
etc?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Farming out Tachyon to a module? (I felt like someone
proposed
>>>>>>>>> this?)
>>>>>>>>>
>>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>>
>>>>>>>>> Continue that same effort for EC2?
>>>>>>>>>
>>>>>>>>> Farming out some of the "external" integrations to another
repo (?
>>>>>>>>>
>>>>>>>>> controversial)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Mime
View raw message