spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Spark Improvement Proposals
Date Sun, 09 Oct 2016 20:40:52 GMT
Yup, but the example you gave is for alternatives about *user-facing behavior*, not implementation.
The current SIP doc describes "strategy" more as implementation strategy. I'm just saying
there are different possible goals for these types of docs.

BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but also require a reference
implementation. This is a bit different from what Cody had in mind, I think.

Matei

> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.chammas@gmail.com> wrote:
> 
> Rejected strategies: I personally wouldn’t put this, because what’s the point of
voting to reject a strategy before you’ve really begun designing and implementing something?
What if you discover that the strategy is actually better when you start doing stuff?
> I would guess the point is to document alternatives that were discussed and rejected,
so that later on people can be pointed to that discussion and the devs don’t have to repeat
themselves unnecessarily every time someone comes along and asks “Why didn’t you do this
other thing?” That doesn’t mean a rejected proposal can’t later be revisited and the
SIP can’t be updated.
> 
> For reference from the Python community, PEP 492 <https://www.python.org/dev/peps/pep-0492/>,
a Python Enhancement Proposal for adding async and await syntax and “first-class” coroutines
to Python, has a section on rejected ideas <https://www.python.org/dev/peps/pep-0492/#why-async-def>
for the new syntax. It captures a summary of what the devs discussed, but it doesn’t mean
the PEP can’t be updated and a previously rejected proposal can’t be revived.
> 
> At least in the Python community, a PEP serves not just as formal starting point for
a proposal (the “real” starting point is usually a discussion on python-ideas or python-dev),
but also as documentation of what was agreed on and a living “spec” of sorts. So PEPs
sometimes get updated years after they are approved when revisions are agreed upon. PEPs are
also intended for wide consumption, vs. bug tracker issues which the broader Python dev community
are not expected to follow closely.
> 
> Dunno if we want to follow a similar pattern for Spark, since the project’s needs are
different. But the Python community has used PEPs to help organize and steer development since
2000; there are plenty of examples there we can probably take inspiration from.
> 
> By the way, can we call these things something other than Spark Improvement Proposals?
The acronym, SIP, conflicts with Scala SIPs <http://docs.scala-lang.org/sips/index.html>.
Since the Scala and Spark communities have a lot of overlap, we don’t want, for example,
names like “SIP-10” to have an ambiguous meaning.
> 
> Nick
> 
> 
> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaharia@gmail.com <mailto:matei.zaharia@gmail.com>>
wrote:
> Hi Cody,
> 
> I think this would be a lot more concrete if we had a more detailed template for SIPs.
Right now, it's not super clear what's in scope -- e.g. are  they a way to solicit feedback
on the user-facing behavior or on the internals? "Goals" can cover both things. I've been
thinking of SIPs more as Product Requirements Docs (PRDs), which focus on *what* a code change
should do as opposed to how.
> 
> In particular, here are some things that you may or may not consider in scope for SIPs:
> 
> - Goals and non-goals: This is definitely in scope, and IMO should focus on user-visible
behavior (e.g. "system supports SQL window functions" or "system continues working if one
node fails"). BTW I wouldn't say "rejected goals" because some of them might become goals
later, so we're not definitively rejecting them.
> 
> - Public API: Probably should be included in most SIPs unless it's too large to fully
specify then (e.g. "let's add an ML library").
> 
> - Use cases: I usually find this very useful in PRDs to better communicate the goals.
> 
> - Internal architecture: This is usually *not* a thing users can easily comment on and
it sounds more like a design doc item. Of course it's important to show that the SIP is feasible
to implement. One exception, however, is that I think we'll have some SIPs primarily on internals
(e.g. if somebody wants to refactor Spark's query optimizer or something).
> 
> - Rejected strategies: I personally wouldn't put this, because what's the point of voting
to reject a strategy before you've really begun designing and implementing something? What
if you discover that the strategy is actually better when you start doing stuff?
> 
> At a super high level, it depends on whether you want the SIPs to be PRDs for getting
some quick feedback on the goals of a feature before it is designed, or something more like
full-fledged design docs (just a more visible design doc for bigger changes). I looked at
Kafka's KIPs, and they actually seem to be more like design docs. This can work too but it
does require more work from the proposer and it can lead to the same problems you mentioned
with people already having a design and implementation in mind.
> 
> Basically, the question is, are you trying to iterate faster on design by adding a step
for user feedback earlier? Or are you just trying to make design docs for key features more
visible (and their approval more formal)?
> 
> BTW note that in either case, I'd like to have a template for design docs too, which
should also include goals. I think that would've avoided some of the issues you brought up.
> 
> Matei
> 
>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <cody@koeninger.org <mailto:cody@koeninger.org>>
wrote:
>> 
>> Here's my specific proposal (meta-proposal?)
>> 
>> Spark Improvement Proposals (SIP)
>> 
>> 
>> 
>> Background:
>> 
>> The current problem is that design and implementation of large features are often
done in private, before soliciting user feedback.
>> 
>> When feedback is solicited, it is often as to detailed design specifics, not focused
on goals.
>> 
>> When implementation does take place after design, there is often disagreement as
to what goals are or are not in scope.
>> 
>> This results in commits that don't fully meet user needs.
>> 
>> 
>> 
>> Goals:
>> 
>> - Ensure user, contributor, and committer goals are clearly identified and agreed
upon, before implementation takes place.
>> 
>> - Ensure that a technically feasible strategy is chosen that is likely to meet the
goals.
>> 
>> 
>> 
>> Rejected Goals:
>> 
>> - SIPs are not for detailed design.  Design by committee doesn't work.
>> 
>> - SIPs are not for every change.  We dont need that much process.
>> 
>> 
>> 
>> Strategy:
>> 
>> My suggestion is outlined as a Spark Improvement Proposal process documented at
>> 
>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
<https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md>
>> Specifics of Jira manipulation are an implementation detail we can figure out.
>> 
>> I'm suggesting voting; the need here is for a _clear_ outcome.
>> 
>> 
>> 
>> Rejected Strategies:
>> 
>> Having someone who understands the problem implement it first works, but only if
significant iteration after user feedback is allowed.
>> 
>> Historically this has been problematic due to pressure to limit public api changes.
>> 
>> 
>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <rxin@databricks.com <mailto:rxin@databricks.com>>
wrote:
>> Alright looks like there are quite a bit of support. We should wait to hear from
more people too.
>> 
>> To push this forward, Cody and I will be working together in the next couple of weeks
to come up with a concrete, detailed proposal on what this entails, and then we can discuss
this the specific proposal as well.
>> 
>> 
>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <cody@koeninger.org <mailto:cody@koeninger.org>>
wrote:
>> Yeah, in case it wasn't clear, I was talking about SIPs for major user-facing or
cross-cutting changes, not minor feature adds.
>> 
>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <stavros.kontopoulos@lightbend.com
<mailto:stavros.kontopoulos@lightbend.com>> wrote:
>> +1 to the SIP label as long as it does not slow down things and it targets optimizing
efforts, coordination etc. For example really small features should not need to go through
this process (assuming they dont touch public interfaces)  or re-factorings and hope it will
be kept this way. So as a guideline doc should be provided, like in the KIP case. 
>> 
>> IMHO so far aside from tagging things and linking them elsewhere simply having design
docs and prototypes implementations in PRs is not something that has not worked so far. What
is really a pain in many projects out there is discontinuity in progress of PRs, missing features,
slow reviews which is understandable to some extent... it is not only about Spark but things
can be improved for sure for this project in particular as already stated.
>> 
>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <cody@koeninger.org <mailto:cody@koeninger.org>>
wrote:
>> +1 to adding an SIP label and linking it from the website.  I think it needs
>> 
>> - template that focuses it towards soliciting user goals / non goals
>> - clear resolution as to which strategy was chosen to pursue.  I'd
>> recommend a vote.
>> 
>> Matei asked me to clarify what I meant by changing interfaces, I think
>> it's directly relevant to the SIP idea so I'll clarify here, and split
>> a thread for the other discussion per Nicholas' request.
>> 
>> I meant changing public user interfaces.  I think the first design is
>> unlikely to be right, because it's done at a time when you have the
>> least information.  As a user, I find it considerably more frustrating
>> to be unable to use a tool to get my job done, than I do having to
>> make minor changes to my code in order to take advantage of features.
>> I've seen committers be seriously reluctant to allow changes to
>> @experimental code that are needed in order for it to really work
>> right.  You need to be able to iterate, and if people on both sides of
>> the fence aren't going to respect that some newer apis are subject to
>> change, then why even mark them as such?
>> 
>> Ideally a finished SIP should give me a checklist of things that an
>> implementation must do, and things that it doesn't need to do.
>> Contributors/committers should be seriously discouraged from putting
>> out a version 0.1 that doesn't have at least a prototype
>> implementation of all those things, especially if they're then going
>> to argue against interface changes necessary to get the the rest of
>> the things done in the 0.2 version.
>> 
>> 
>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <rxin@databricks.com <mailto:rxin@databricks.com>>
wrote:
>> > I like the lightweight proposal to add a SIP label.
>> >
>> > During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> > track the list of major changes, but that never really materialized due to
>> > the overhead. Adding a SIP label on major JIRAs and then link to them
>> > prominently on the Spark website makes a lot of sense.
>> >
>> >
>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <matei.zaharia@gmail.com <mailto:matei.zaharia@gmail.com>>
>> > wrote:
>> >>
>> >> For the improvement proposals, I think one major point was to make them
>> >> really visible to users who are not contributors, so we should do more than
>> >> sending stuff to dev@. One very lightweight idea is to have a new type of
>> >> JIRA called a SIP and have a link to a filter that shows all such JIRAs
from
>> >> http://spark.apache.org <http://spark.apache.org/>. I also like the
idea of SIP and design doc
>> >> templates (in fact many projects have them).
>> >>
>> >> Matei
>> >>
>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin <rxin@databricks.com <mailto:rxin@databricks.com>>
wrote:
>> >>
>> >> I called Cody last night and talked about some of the topics in his email.
>> >> It became clear to me Cody genuinely cares about the project.
>> >>
>> >> Some of the frustrations come from the success of the project itself
>> >> becoming very "hot", and it is difficult to get clarity from people who
>> >> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> >> to scaling an engineering team in a successful startup: old processes that
>> >> worked well might not work so well when it gets to a certain size, cultures
>> >> can get diluted, building culture vs building process, etc.
>> >>
>> >> I also really like to have a more visible process for larger changes,
>> >> especially major user facing API changes. Historically we upload design
docs
>> >> for major changes, but it is not always consistent and difficult to quality
>> >> of the docs, due to the volunteering nature of the organization.
>> >>
>> >> Some of the more concrete ideas we discussed focus on building a culture
>> >> to improve clarity:
>> >>
>> >> - Process: Large changes should have design docs posted on JIRA. One thing
>> >> Cody and I didn't discuss but an idea that just came to me is we should
>> >> create a design doc template for the project and ask everybody to follow.
>> >> The design doc template should also explicitly list goals and non-goals,
to
>> >> make design doc more consistent.
>> >>
>> >> - Process: Email dev@ to solicit feedback. We have some this with some
>> >> changes, but again very inconsistent. Just posting something on JIRA isn't
>> >> sufficient, because there are simply too many JIRAs and the signal get lost
>> >> in the noise. While this is generally impossible to enforce because we can't
>> >> force all volunteers to conform to a process (or they might not even be
>> >> aware of this),  those who are more familiar with the project can help by
>> >> emailing the dev@ when they see something that hasn't been.
>> >>
>> >> - Culture: The design doc author(s) should be open to feedback. A design
>> >> doc should serve as the base for discussion and is by no means the final
>> >> design. Of course, this does not mean the author has to accept every
>> >> feedback. They should also be comfortable accepting / rejecting ideas on
>> >> technical grounds.
>> >>
>> >> - Process / Culture: For major ongoing projects, it can be useful to have
>> >> some monthly Google hangouts that are open to the world. I am actually not
>> >> sure how well this will work, because of the volunteering nature and we
need
>> >> to adjust for timezones for people across the globe, but it seems worth
>> >> trying.
>> >>
>> >> - Culture: Contributors (including committers) should be more direct in
>> >> setting expectations, including whether they are working on a specific
>> >> issue, whether they will be working on a specific issue, and whether an
>> >> issue or pr or jira should be rejected. Most people I know in this community
>> >> are nice and don't enjoy telling other people no, but it is often more
>> >> annoying to a contributor to not know anything than getting a no.
>> >>
>> >>
>> >> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <matei.zaharia@gmail.com
<mailto:matei.zaharia@gmail.com>>
>> >> wrote:
>> >>>
>> >>>
>> >>> Love the idea of a more visible "Spark Improvement Proposal" process
that
>> >>> solicits user input on new APIs. For what it's worth, I don't think
>> >>> committers are trying to minimize their own work -- every committer
cares
>> >>> about making the software useful for users. However, it is always hard
to
>> >>> get user input and so it helps to have this kind of process. I've certainly
>> >>> looked at the *IPs a lot in other software I use just to see the biggest
>> >>> things on the roadmap.
>> >>>
>> >>> When you're talking about "changing interfaces", are you talking about
>> >>> public or internal APIs? I do think many people hate changing public
APIs
>> >>> and I actually think that's for the best of the project. That's a technical
>> >>> debate, but basically, the worst thing when you're using a piece of
software
>> >>> is that the developers constantly ask you to rewrite your app to update
to a
>> >>> new version (and thus benefit from bug fixes, etc). Cue anyone who's
used
>> >>> Protobuf, or Guava. The "let's get everyone to change their code this
>> >>> release" model works well within a single large company, but doesn't
work
>> >>> well for a community, which is why nearly all *very* widely used programming
>> >>> interfaces (I'm talking things like Java standard library, Windows API,
etc)
>> >>> almost *never* break backwards compatibility. All this is done within
reason
>> >>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>> >>
>> >>
>> >>
>> >>
>> >
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org>
>> 
>> 
>> 
>> 
>> -- 
>> Stavros Kontopoulos
>> Senior Software Engineer
>> Lightbend, Inc.
>> p:  +30 6977967274
>>  <tel:%2B1%20650%20678%200020>
>> e: stavros.kontopoulos@lightbend.com <mailto:dave.martin@lightbend.com>
>> 
>> 
>> 
>> 
>> 
> 


Mime
View raw message