Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAAsvFP=cWBcZOQ1J8JrY85ee8WvF6_Sz0YJi1PrcYH+RQ7XzzA@mail.gmail.com>
References: 
 <CAPh_B=YsAfmn9RRiGD+WXGrL-tJWWnkcwzgvkEAq__4bia7Wdg@mail.gmail.com>
 <CAMAsSdLKG8XM5A0=EQR0tfQ4MJmaacKZm2sxJ2bP0jiCd_+GUw@mail.gmail.com>
 <8D051899-9583-44D2-A838-258E29E8CD9B@gmail.com>
 <F9F303E5-4339-4675-89C4-AB8A8B37F832@qq.com>
 <12E2B2F6075044F68EAA8B8153B97023@gmail.com>
 <9D5B00849D2CDA4386BDA89E83F69E6C194BD57E@G4W3292.americas.hpqcorp.net>
 <CAOhmDzcvv3z1tao8bAPTzhxZAn=d-mMaTiXHdJ53WLwotAaGew@mail.gmail.com>
 <CAGrwSrO1jLUXWWa-ftHK45RFT9_BQoE-kBWmGgnWOGynW+m5cg@mail.gmail.com>
 <80833ADD533E324CA05C160E41B6366103D157BD@shsmsx102.ccr.corp.intel.com>
 <CAAsvFPmyrffhOK5bWwif=PwT-LX=Qgr1HGo9EnfrERwTUk+gtg@mail.gmail.com>
 <CACkSZy1PzCrqnspOUC_Sz5Tn1RV8TjJjRQ3h94DRkKp+R-pHBg@mail.gmail.com>
 <CAAsvFPkvnmaoC7mRJxRqtMJQYCXanjaQ6SS+BJ-NWmAwgRmFHQ@mail.gmail.com>
 <80833ADD533E324CA05C160E41B6366103D15A12@shsmsx102.ccr.corp.intel.com>
 <CAGrwSrOW+g963kDGViec-nL+ML=ApANvJQYpgECEKPUZWMpTYw@mail.gmail.com>
 <CAAsvFPnusZmCew6YTn0H2FrjwEhfeLQ5mq9Q7Zwg8W3c_70kqA@mail.gmail.com>
 <CAGrwSrMq5QUSQw_pXh2_hVkHPFDd95Xwk2F5JO=e9=e50hrrnw@mail.gmail.com>
 <CAAsvFP=cWBcZOQ1J8JrY85ee8WvF6_Sz0YJi1PrcYH+RQ7XzzA@mail.gmail.com>
From: Reynold Xin <rxin@databricks.com>
Date: Mon, 23 Nov 2015 22:51:50 -0800
Message-ID: 
 <CAPh_B=aSMQRPvtK_7xQdOt2HRkTt=1qVDoQncaE4cV45EBmz+Q@mail.gmail.com>
Subject: Re: A proposal for Spark 2.0
To: Mark Hamstra <mark@clearstorydata.com>
Cc: Kostas Sakellis <kostas@cloudera.com>,
 "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary=047d7b2ed51de5e3c5052543c74b

--047d7b2ed51de5e3c5052543c74b
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I actually think the next one (after 1.6) should be Spark 2.0. The reason
is that I already know we have to break some part of the DataFrame/Dataset
API as part of the Dataset design. (e.g. DataFrame.map should return
Dataset rather than RDD). In that case, I'd rather break this sooner (in
one release) than later (in two releases). so the damage is smaller.

I don't think whether we call Dataset/DataFrame experimental or not matters
too much for 2.0. We can still call Dataset experimental in 2.0 and then
mark them as stable in 2.1. Despite being "experimental", there has been no
breaking changes to DataFrame from 1.3 to 1.6.


On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <mark@clearstorydata.com>
wrote:

> Ah, got it; by "stabilize" you meant changing the API, not just bug
> fixing.  We're on the same page now.
>
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kostas@cloudera.com>
> wrote:
>
>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>> releases. The Dataset API is experimental and so we might be changing th=
e
>> APIs before we declare it stable. This is why I think it is important to
>> first stabilize the Dataset API with a Spark 1.7 release before moving t=
o
>> Spark 2.0. This will benefit users that would like to use the new Datase=
t
>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>
>> Kostas
>>
>>
>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <mark@clearstorydata.com>
>> wrote:
>>
>>> Why does stabilization of those two features require a 1.7 release
>>> instead of 1.6.1?
>>>
>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kostas@cloudera.com>
>>> wrote:
>>>
>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd l=
ike
>>>> to propose we have one more 1.x release after Spark 1.6. This will all=
ow us
>>>> to stabilize a few of the new features that were added in 1.6:
>>>>
>>>> 1) the experimental Datasets API
>>>> 2) the new unified memory manager.
>>>>
>>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>>> there will be users that won't be able to seamlessly upgrade given wha=
t we
>>>> have discussed as in scope for 2.0. For these users, having a 1.x rele=
ase
>>>> with these new features/APIs stabilized will be very beneficial. This =
might
>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thi=
ng.
>>>>
>>>> Any thoughts on this timeline?
>>>>
>>>> Kostas Sakellis
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.cheng@intel.com>
>>>> wrote:
>>>>
>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>
>>>>>
>>>>>
>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category=
, as
>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>> *To:* Stephen Boesch
>>>>>
>>>>> *Cc:* dev@spark.apache.org
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> Hmmm... to me, that seems like precisely the kind of thing that argue=
s
>>>>> for retaining the RDD API but not as the first thing presented to new=
 Spark
>>>>> developers: "Here's how to use groupBy with DataFrames.... Until the
>>>>> optimizer is more fully developed, that won't always get you the best
>>>>> performance that can be obtained.  In these particular circumstances,=
 ...,
>>>>> you may want to use the low-level RDD API while setting
>>>>> preservesPartitioning to true.  Like this...."
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <javadba@gmail.com>
>>>>> wrote:
>>>>>
>>>>> My understanding is that  the RDD's presently have more support for
>>>>> complete control of partitioning which is a key consideration at scal=
e.
>>>>> While partitioning control is still piecemeal in  DF/DS  it would see=
m
>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>
>>>>>
>>>>>
>>>>> An example is the use of groupBy when we know that the source relatio=
n
>>>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the=
 spark
>>>>> sql still does not allow that knowledge to be applied to the optimize=
r - so
>>>>> a full shuffle will be performed. However in the native RDD we can us=
e
>>>>> preservesPartitioning=3Dtrue.
>>>>>
>>>>>
>>>>>
>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <mark@clearstorydata.com>:
>>>>>
>>>>> The place of the RDD API in 2.0 is also something I've been wondering
>>>>> about.  I think it may be going too far to deprecate it, but changing
>>>>> emphasis is something that we might consider.  The RDD API came well =
before
>>>>> DataFrames and DataSets, so programming guides, introductory how-to
>>>>> articles and the like have, to this point, also tended to emphasize R=
DDs --
>>>>> or at least to deal with them early.  What I'm thinking is that with =
2.0
>>>>> maybe we should overhaul all the documentation to de-emphasize and
>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>> introduced and fully addressed before RDDs.  They would be presented =
as the
>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast=
,
>>>>> would be presented later as a kind of lower-level, closer-to-the-meta=
l API
>>>>> that can be used in atypical, more specialized contexts where DataFra=
mes or
>>>>> DataSets don't fully fit.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.cheng@intel.com>
>>>>> wrote:
>>>>>
>>>>> I am not sure what the best practice for this specific problem, but
>>>>> it=E2=80=99s really worth to think about it in 2.0, as it is a painfu=
l issue for
>>>>> lots of users.
>>>>>
>>>>>
>>>>>
>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>> DataFrame or DataSet.
>>>>>
>>>>>
>>>>>
>>>>> Hao
>>>>>
>>>>>
>>>>>
>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>> *To:* Nicholas Chammas
>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>>> Reynold Xin
>>>>>
>>>>>
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>> that with Spark 2.0 we can also look at better classpath isolation wi=
th
>>>>> user programs. I propose we build on
>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by defaul=
t, and
>>>>> not allow any spark transitive dependencies to leak into user code. F=
or
>>>>> backwards compatibility we can have a whitelist if we want but I'd be=
 good
>>>>> if we start requiring user apps to explicitly pull in all their
>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>> direction.
>>>>>
>>>>>
>>>>>
>>>>> Kostas
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>> With regards to Machine learning, it would be great to move useful
>>>>> features from MLlib to ML and deprecate the former. Current structure=
 of
>>>>> two separate machine learning packages seems to be somewhat confusing=
.
>>>>>
>>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>> Tungsten.
>>>>>
>>>>> On that note of deprecating stuff, it might be good to deprecate some
>>>>> things in 2.0 without removing or replacing them immediately. That wa=
y 2.0
>>>>> doesn=E2=80=99t have to wait for everything that we want to deprecate=
 to be
>>>>> replaced all at once.
>>>>>
>>>>> Nick
>>>>>
>>>>> =E2=80=8B
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>
>>>>> Parameter Server is a new feature and thus does not match the goal of
>>>>> 2.0 is =E2=80=9Cto fix things that are broken in the current API and =
remove certain
>>>>> deprecated APIs=E2=80=9D. At the same time I would be happy to have t=
hat feature.
>>>>>
>>>>>
>>>>>
>>>>> With regards to Machine learning, it would be great to move useful
>>>>> features from MLlib to ML and deprecate the former. Current structure=
 of
>>>>> two separate machine learning packages seems to be somewhat confusing=
.
>>>>>
>>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>> Tungsten.
>>>>>
>>>>>
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>>
>>>>>
>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>> *To:* witgo@qq.com
>>>>> *Cc:* dev@spark.apache.org
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> Being specific to Parameter Server, I think the current agreement is
>>>>> that PS shall exist as a third-party library instead of a component o=
f the
>>>>> core code base, isn=E2=80=99t?
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Nan Zhu
>>>>>
>>>>> http://codingcat.me
>>>>>
>>>>>
>>>>>
>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>
>>>>> Who has the idea of machine learning? Spark missing some features for
>>>>> machine learning, For example, the parameter server.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> =E5=9C=A8 2015=E5=B9=B411=E6=9C=8812=E6=97=A5=EF=BC=8C05:32=EF=BC=8CM=
atei Zaharia <matei.zaharia@gmail.com> =E5=86=99=E9=81=93=EF=BC=9A
>>>>>
>>>>>
>>>>>
>>>>> I like the idea of popping out Tachyon to an optional component too t=
o
>>>>> reduce the number of dependencies. In the future, it might even be us=
eful
>>>>> to do this for Hadoop, but it requires too many API changes to be wor=
th
>>>>> doing now.
>>>>>
>>>>>
>>>>>
>>>>> Regarding Scala 2.12, we should definitely support it eventually, but
>>>>> I don't think we need to block 2.0 on that because it can be added la=
ter
>>>>> too. Has anyone investigated what it would take to run on there? I im=
agine
>>>>> we don't need many code changes, just maybe some REPL stuff.
>>>>>
>>>>>
>>>>>
>>>>> Needless to say, but I'm all for the idea of making "major" releases
>>>>> as undisruptive as possible in the model Reynold proposed. Keeping ev=
eryone
>>>>> working with the same set of releases is super important.
>>>>>
>>>>>
>>>>>
>>>>> Matei
>>>>>
>>>>>
>>>>>
>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <sowen@cloudera.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>
>>>>> to the Spark community. A major release should not be very different
>>>>> from a
>>>>>
>>>>> minor release and should not be gated based on new features. The main
>>>>>
>>>>> purpose of a major release is an opportunity to fix things that are
>>>>> broken
>>>>>
>>>>> in the current API and remove certain deprecated APIs (examples
>>>>> follow).
>>>>>
>>>>>
>>>>>
>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>
>>>>> time to replace some big old API or implementation with a new one, bu=
t
>>>>>
>>>>> I don't see obvious candidates.
>>>>>
>>>>>
>>>>>
>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>
>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>
>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>> 2.10, but
>>>>>
>>>>> it has been end-of-life.
>>>>>
>>>>>
>>>>>
>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 wil=
l
>>>>>
>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>>>
>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2. Remove Hadoop 1 support.
>>>>>
>>>>>
>>>>>
>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>
>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>
>>>>>
>>>>>
>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>
>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>
>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>
>>>>>
>>>>>
>>>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>>>
>>>>> Pop out any Docker stuff to another repo?
>>>>>
>>>>> Continue that same effort for EC2?
>>>>>
>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>
>>>>> controversial)
>>>>>
>>>>>
>>>>>
>>>>> See also anything marked version "2+" in JIRA.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

--047d7b2ed51de5e3c5052543c74b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I actually think the next one (after 1.6) should be Spark =
2.0. The reason is that I already know we have to break some part of the Da=
taFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map shou=
ld return Dataset rather than RDD). In that case, I&#39;d rather break this=
 sooner (in one release) than later (in two releases). so the damage is sma=
ller.<div><br></div><div>I don&#39;t think whether we call Dataset/DataFram=
e experimental or not matters too much for 2.0. We can still call Dataset e=
xperimental in 2.0 and then mark them as stable in 2.1. Despite being &quot=
;experimental&quot;, there has been no breaking changes to DataFrame from 1=
.3 to 1.6.</div><div><br></div><div><br></div></div><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote">On Wed, Nov 18, 2015 at 3:43 PM, Mark Ha=
mstra <span dir=3D"ltr">&lt;<a href=3D"mailto:mark@clearstorydata.com" targ=
et=3D"_blank">mark@clearstorydata.com</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex"><div dir=3D"ltr">Ah, got it; by &quot;stabilize&quot; yo=
u meant changing the API, not just bug fixing.=C2=A0 We&#39;re on the same =
page now.</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_=
extra"><br><div class=3D"gmail_quote">On Wed, Nov 18, 2015 at 3:39 PM, Kost=
as Sakellis <span dir=3D"ltr">&lt;<a href=3D"mailto:kostas@cloudera.com" ta=
rget=3D"_blank">kostas@cloudera.com</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex"><div dir=3D"ltr"><div>A 1.6.x release will only fix bugs -=
 we typically don&#39;t change APIs in z releases. The Dataset API is exper=
imental and so we might be changing the APIs before we declare it stable. T=
his is why I think it is important to first stabilize the Dataset API with =
a Spark 1.7 release before moving to Spark 2.0. This will benefit users tha=
t would like to use the new Dataset APIs but can&#39;t move to Spark 2.0 be=
cause of the backwards incompatible changes, like removal of deprecated API=
s, Scala 2.11 etc.<span><font color=3D"#888888"><br></font></span></div><sp=
an><font color=3D"#888888"><div><br></div><div>Kostas</div><div><br></div><=
/font></span></div><div><div><div class=3D"gmail_extra"><br><div class=3D"g=
mail_quote">On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <span dir=3D"ltr=
">&lt;<a href=3D"mailto:mark@clearstorydata.com" target=3D"_blank">mark@cle=
arstorydata.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"ltr">Why does stabilization of those two features require a 1.7 re=
lease instead of 1.6.1?</div><div><div><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:kostas@cloudera.com" target=3D"_blank"=
>kostas@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr"><div>We have veered off the topic of Spark 2.0 a little =
bit here - yes we can talk about RDD vs. DS/DF more but lets refocus on Spa=
rk 2.0. I&#39;d like to propose we have one more 1.x release after Spark 1.=
6. This will allow us to stabilize a few of the new features that were adde=
d in 1.6:</div><div><br></div><div>1) the experimental Datasets API</div><d=
iv>2) the new unified memory manager.</div><div><br></div><div>I understand=
 our goal for Spark 2.0 is to offer an easy transition but there will be us=
ers that won&#39;t be able to seamlessly upgrade given what we have discuss=
ed as in scope for 2.0. For these users, having a 1.x release with these ne=
w features/APIs stabilized will be very beneficial. This might make Spark 1=
.7 a lighter release but that is not necessarily a bad thing.</div><div><br=
></div><div>Any thoughts on this timeline?</div><span><font color=3D"#88888=
8"><div><br></div><div>Kostas Sakellis</div><div><br></div><div><br></div><=
/font></span></div><div><div><div class=3D"gmail_extra"><br><div class=3D"g=
mail_quote">On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <span dir=3D"ltr">&=
lt;<a href=3D"mailto:hao.cheng@intel.com" target=3D"_blank">hao.cheng@intel=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Agree, more features/apis/optimizatio=
n need to be added in DF/DS.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">I mean, we need to think about what k=
ind of RDD APIs we have to provide to developer, maybe the fundamental API =
is enough, like, the ShuffledRDD etc..=C2=A0 But PairRDDFunctions
 probably not in this category, as we can do the same thing easily with DF/=
DS, even better performance.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><a name=3D"1511cfb6211bdb56_1511cf7c8d3a5eba_1510286=
c37b500ae_151025d350aa6cdd_150ff25144034db1__MailEndCompose"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:#1f497d"=
><u></u>=C2=A0<u></u></span></a></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,sans-serif">From:</span></b><span style=3D"font-size:11.0pt;=
font-family:&quot;Calibri&quot;,sans-serif"> Mark Hamstra [mailto:<a href=
=3D"mailto:mark@clearstorydata.com" target=3D"_blank">mark@clearstorydata.c=
om</a>]
<br>
<b>Sent:</b> Friday, November 13, 2015 11:23 AM<br>
<b>To:</b> Stephen Boesch</span></p><div><div><br>
<b>Cc:</b> <a href=3D"mailto:dev@spark.apache.org" target=3D"_blank">dev@sp=
ark.apache.org</a><br>
<b>Subject:</b> Re: A proposal for Spark 2.0<u></u><u></u></div></div><p></=
p><div><div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">Hmmm... to me, that seems like precisely the kind of=
 thing that argues for retaining the RDD API but not as the first thing pre=
sented to new Spark developers: &quot;Here&#39;s how to use groupBy with Da=
taFrames.... Until the optimizer is more fully
 developed, that won&#39;t always get you the best performance that can be =
obtained.=C2=A0 In these particular circumstances, ..., you may want to use=
 the low-level RDD API while setting preservesPartitioning to true.=C2=A0 L=
ike this....&quot;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch &lt;=
<a href=3D"mailto:javadba@gmail.com" target=3D"_blank">javadba@gmail.com</a=
>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p class=3D"MsoNormal">My understanding is that =C2=A0the RDD&#39;s present=
ly have more support for complete control of partitioning which is a key co=
nsideration at scale.=C2=A0 While partitioning control is still piecemeal i=
n =C2=A0DF/DS =C2=A0it would seem premature to make RDD&#39;s
 a second-tier approach to spark dev. =C2=A0 =C2=A0<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">An example is the use of groupBy when we know that t=
he source relation (/RDD) is already partitioned on the grouping expression=
s.=C2=A0 AFAIK the spark sql still does not allow that knowledge to be appl=
ied to the optimizer - so a full shuffle
 will be performed. However in the native RDD we can use preservesPartition=
ing=3Dtrue.<u></u><u></u></p>
</div>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">2015-11-12 17:42 GMT-08:00 Mark Hamstra &lt;<a href=
=3D"mailto:mark@clearstorydata.com" target=3D"_blank">mark@clearstorydata.c=
om</a>&gt;:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p class=3D"MsoNormal">The place of the RDD API in 2.0 is also something I&=
#39;ve been wondering about.=C2=A0 I think it may be going too far to depre=
cate it, but changing emphasis is something that we might consider.=C2=A0 T=
he RDD API came well before DataFrames and DataSets,
 so programming guides, introductory how-to articles and the like have, to =
this point, also tended to emphasize RDDs -- or at least to deal with them =
early.=C2=A0 What I&#39;m thinking is that with 2.0 maybe we should overhau=
l all the documentation to de-emphasize and
 reposition RDDs.=C2=A0 In this scheme, DataFrames and DataSets would be in=
troduced and fully addressed before RDDs.=C2=A0 They would be presented as =
the normal/default/standard way to do things in Spark.=C2=A0 RDDs, in contr=
ast, would be presented later as a kind of lower-level,
 closer-to-the-metal API that can be used in atypical, more specialized con=
texts where DataFrames or DataSets don&#39;t fully fit.=C2=A0<u></u><u></u>=
</p>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao &lt;<a h=
ref=3D"mailto:hao.cheng@intel.com" target=3D"_blank">hao.cheng@intel.com</a=
>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">I am not sure what the best practice =
for this specific problem, but it=E2=80=99s really worth to think about
 it in 2.0, as it is a painful issue for lots of users.</span><u></u><u></u=
></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">By the way, is it also an opportunity=
 to deprecate the RDD API (or internal API only?)? As lots of
 its functionality overlapping with DataFrame or DataSet.</span><u></u><u><=
/u></p>
<p class=3D"MsoNormal"><a name=3D"1511cfb6211bdb56_1511cf7c8d3a5eba_1510286=
c37b500ae_151025d350aa6cdd_150ff25144034db1_150fece054445097_150fe82679ce65=
17_150fe6"><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,=
sans-serif;color:#1f497d">=C2=A0</span></a><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Hao</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,sans-serif">From:</span></b><span style=3D"font-size:11.0pt;=
font-family:&quot;Calibri&quot;,sans-serif"> Kostas Sakellis [mailto:<a hre=
f=3D"mailto:kostas@cloudera.com" target=3D"_blank">kostas@cloudera.com</a>]
<br>
<b>Sent:</b> Friday, November 13, 2015 5:27 AM<br>
<b>To:</b> Nicholas Chammas<br>
<b>Cc:</b> Ulanov, Alexander; Nan Zhu; <a href=3D"mailto:witgo@qq.com" targ=
et=3D"_blank">
witgo@qq.com</a>; <a href=3D"mailto:dev@spark.apache.org" target=3D"_blank"=
>dev@spark.apache.org</a>; Reynold Xin</span><u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal"><br>
<b>Subject:</b> Re: A proposal for Spark 2.0<u></u><u></u></p>
</div>
</div>
<div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">I know we want to keep breaking changes to a minimum=
 but I&#39;m hoping that with Spark 2.0 we can also look at better classpat=
h isolation with user programs. I propose we build on
 spark.{driver|executor}.userClassPathFirst, setting it true by default, an=
d not allow any spark transitive dependencies to leak into user code. For b=
ackwards compatibility we can have a whitelist if we want but I&#39;d be go=
od if we start requiring user apps to
 explicitly pull in all their dependencies. From what I can tell, Hadoop 3 =
is also moving in this direction.<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Kostas<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas &l=
t;<a href=3D"mailto:nicholas.chammas@gmail.com" target=3D"_blank">nicholas.=
chammas@gmail.com</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt;quotes:none">
<div>
<div>
<blockquote style=3D"border:none;border-left:solid #dddddd 3.0pt;padding:0c=
m 0cm 0cm 12.0pt;margin-left:0cm;margin-top:14.4pt;margin-right:0cm;margin-=
bottom:14.4pt;margin:1.2em!important">
<p style=3D"margin:1.2em!important"><span style=3D"color:#777777">With rega=
rds to Machine learning, it would be great to move useful features from MLl=
ib to ML and deprecate the former. Current structure of two separate machin=
e learning packages seems to be somewhat
 confusing.</span><u></u><u></u></p>
<p style=3D"margin:1.2em!important"><span style=3D"color:#777777">With rega=
rds to GraphX, it would be great to deprecate the use of RDD in GraphX and =
switch to Dataframe. This will allow GraphX evolve with Tungsten.</span><u>=
</u><u></u></p>
</blockquote>
<p style=3D"margin:1.2em!important">On that note of deprecating stuff, it m=
ight be good to deprecate some things in 2.0 without removing or replacing =
them immediately. That way 2.0 doesn=E2=80=99t have to wait for everything =
that we want to deprecate to be replaced all
 at once.<u></u><u></u></p>
<p style=3D"margin:1.2em!important">Nick<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:1.0pt">=E2=80=8B</span><u><=
/u><u></u></p>
</div>
</div>
</div>
<div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal">On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander &=
lt;<a href=3D"mailto:alexander.ulanov@hpe.com" target=3D"_blank">alexander.=
ulanov@hpe.com</a>&gt; wrote:<u></u><u></u></p>
</div>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Parameter Server is a new feature and=
 thus does not match the goal of 2.0 is =E2=80=9Cto fix things that are
 broken in the current API and remove certain deprecated APIs=E2=80=9D. At =
the same time I would be happy to have that feature.</span><u></u><u></u></=
p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">With regards to Machine learning, it =
would be great to move useful features from MLlib to ML and deprecate
 the former. Current structure of two separate machine learning packages se=
ems to be somewhat confusing.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">With regards to GraphX, it would be g=
reat to deprecate the use of RDD in GraphX and switch to Dataframe.
 This will allow GraphX evolve with Tungsten.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Best regards, Alexander</span><u></u>=
<u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,sans-serif">From:</span></b><span style=3D"font-size:11.0pt;=
font-family:&quot;Calibri&quot;,sans-serif"> Nan Zhu [mailto:<a href=3D"mai=
lto:zhunanmcgill@gmail.com" target=3D"_blank">zhunanmcgill@gmail.com</a>]
<br>
<b>Sent:</b> Thursday, November 12, 2015 7:28 AM<br>
<b>To:</b> <a href=3D"mailto:witgo@qq.com" target=3D"_blank">witgo@qq.com</=
a><br>
<b>Cc:</b> <a href=3D"mailto:dev@spark.apache.org" target=3D"_blank">dev@sp=
ark.apache.org</a><br>
<b>Subject:</b> Re: A proposal for Spark 2.0</span><u></u><u></u></p>
</div>
</div>
<div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:13.5pt">Being specific to P=
arameter Server, I think the current agreement is that PS shall=C2=A0exist =
as a third-party library instead of a component of the core
 code base, isn=E2=80=99t?</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:13.5pt">Best,</span><u></u>=
<u></u></p>
</div>
<div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">--=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Nan Zhu<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><a href=3D"http://codingcat.me" target=3D"_blank">ht=
tp://codingcat.me</a><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
</div>
<p><span style=3D"color:#a0a0a8">On Thursday, November 12, 2015 at 9:49 AM,=
 <a href=3D"mailto:witgo@qq.com" target=3D"_blank">
witgo@qq.com</a> wrote:</span><u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid windowtext 1.0pt;padding=
:0cm 0cm 0cm 8.0pt;margin-left:0cm;margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<div>
<p class=3D"MsoNormal">Who has the idea of machine learning? Spark missing =
some features for machine learning, For example, the parameter server.<u></=
u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<blockquote style=3D"margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class=3D"MsoNormal"><span lang=3D"ZH-CN" style=3D"font-family:=E5=AE=8B=
=E4=BD=93">=E5=9C=A8</span> 2015<span lang=3D"ZH-CN" style=3D"font-family:=
=E5=AE=8B=E4=BD=93">=E5=B9=B4</span>11<span lang=3D"ZH-CN" style=3D"font-fa=
mily:=E5=AE=8B=E4=BD=93">=E6=9C=88</span>12<span lang=3D"ZH-CN" style=3D"fo=
nt-family:=E5=AE=8B=E4=BD=93">=E6=97=A5=EF=BC=8C</span>05:32<span lang=3D"Z=
H-CN" style=3D"font-family:=E5=AE=8B=E4=BD=93">=EF=BC=8C</span>Matei
 Zaharia &lt;<a href=3D"mailto:matei.zaharia@gmail.com" target=3D"_blank">m=
atei.zaharia@gmail.com</a>&gt;
<span lang=3D"ZH-CN" style=3D"font-family:=E5=AE=8B=E4=BD=93">=E5=86=99=E9=
=81=93=EF=BC=9A</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I like the idea of popping out Tachyon to an optiona=
l component too to reduce the number of dependencies. In the future, it mig=
ht even be useful to do this for Hadoop, but it requires
 too many API changes to be worth doing now.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Regarding Scala 2.12, we should definitely support i=
t eventually, but I don&#39;t think we need to block 2.0 on that because it=
 can be added later too. Has anyone investigated what
 it would take to run on there? I imagine we don&#39;t need many code chang=
es, just maybe some REPL stuff.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Needless to say, but I&#39;m all for the idea of mak=
ing &quot;major&quot; releases as undisruptive as possible in the model Rey=
nold proposed. Keeping everyone working with the same set of releases
 is super important.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Matei<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<blockquote style=3D"margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class=3D"MsoNormal">On Nov 11, 2015, at 4:58 AM, Sean Owen &lt;<a href=
=3D"mailto:sowen@cloudera.com" target=3D"_blank">sowen@cloudera.com</a>&gt;=
 wrote:<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin &lt;<a=
 href=3D"mailto:rxin@databricks.com" target=3D"_blank">rxin@databricks.com<=
/a>&gt; wrote:<u></u><u></u></p>
</div>
<blockquote style=3D"margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class=3D"MsoNormal">to the Spark community. A major release should not b=
e very different from a<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">minor release and should not be gated based on new f=
eatures. The main<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">purpose of a major release is an opportunity to fix =
things that are broken<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">in the current API and remove certain deprecated API=
s (examples follow).<u></u><u></u></p>
</div>
</div>
</blockquote>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Agree with this stance. Generally, a major release m=
ight also be a<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">time to replace some big old API or implementation w=
ith a new one, but<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I don&#39;t see obvious candidates.<u></u><u></u></p=
>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I wouldn&#39;t mind turning attention to 2.x sooner =
than later, unless<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">there&#39;s a fairly good reason to continue adding =
features in 1.x to a<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">1.7 release. The scope as of 1.6 is already pretty d=
arned big.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<blockquote style=3D"margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class=3D"MsoNormal">1. Scala 2.11 as the default build. We should still =
support Scala 2.10, but<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">it has been end-of-life.<u></u><u></u></p>
</div>
</div>
</blockquote>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">By the time 2.x rolls around, 2.12 will be the main =
version, 2.11 will<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">be quite stable, and 2.10 will have been EOL for a w=
hile. I&#39;d propose<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">dropping 2.10. Otherwise it&#39;s supported for 2 mo=
re years.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<blockquote style=3D"margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class=3D"MsoNormal">2. Remove Hadoop 1 support.<u></u><u></u></p>
</div>
</blockquote>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I&#39;d go further to drop support for &lt;2.2 for s=
ure (2.0 and 2.1 were<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">sort of &#39;alpha&#39; and &#39;beta&#39; releases)=
 and even &lt;2.6.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I&#39;m sure we&#39;ll think of a number of other sm=
all things -- shading a<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">bunch of stuff? reviewing and updating dependencies =
in light of<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">simpler, more recent dependencies to support from Ha=
doop etc?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Farming out Tachyon to a module? (I felt like someon=
e proposed this?)<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Pop out any Docker stuff to another repo?<u></u><u><=
/u></p>
</div>
<div>
<p class=3D"MsoNormal">Continue that same effort for EC2?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Farming out some of the &quot;external&quot; integra=
tions to another repo (?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">controversial)<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">See also anything marked version &quot;2+&quot; in J=
IRA.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">----------------------------------------------------=
-----------------<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">To unsubscribe, e-mail:
<a href=3D"mailto:dev-unsubscribe@spark.apache.org" target=3D"_blank">dev-u=
nsubscribe@spark.apache.org</a><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">For additional commands, e-mail:
<a href=3D"mailto:dev-help@spark.apache.org" target=3D"_blank">dev-help@spa=
rk.apache.org</a><u></u><u></u></p>
</div>
</div>
</blockquote>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">----------------------------------------------------=
-----------------<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">To unsubscribe, e-mail:
<a href=3D"mailto:dev-unsubscribe@spark.apache.org" target=3D"_blank">dev-u=
nsubscribe@spark.apache.org</a><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">For additional commands, e-mail:
<a href=3D"mailto:dev-help@spark.apache.org" target=3D"_blank">dev-help@spa=
rk.apache.org</a><u></u><u></u></p>
</div>
</div>
</blockquote>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">----------------------------------------------------=
-----------------<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">To unsubscribe, e-mail:
<a href=3D"mailto:dev-unsubscribe@spark.apache.org" target=3D"_blank">dev-u=
nsubscribe@spark.apache.org</a><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">For additional commands, e-mail:
<a href=3D"mailto:dev-help@spark.apache.org" target=3D"_blank">dev-help@spa=
rk.apache.org</a><u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div></div></div>
</div>

</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7b2ed51de5e3c5052543c74b--