Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 08971200C25 for ; Fri, 24 Feb 2017 09:28:38 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 070B1160B69; Fri, 24 Feb 2017 08:28:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 479E2160B5C for ; Fri, 24 Feb 2017 09:28:36 +0100 (CET) Received: (qmail 11065 invoked by uid 500); 24 Feb 2017 08:28:34 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 11054 invoked by uid 99); 24 Feb 2017 08:28:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Feb 2017 08:28:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1732018E5ED for ; Fri, 24 Feb 2017 08:28:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.69 X-Spam-Level: * X-Spam-Status: No, score=1.69 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kCS-XtSxSwZs for ; Fri, 24 Feb 2017 08:28:30 +0000 (UTC) Received: from mail-it0-f53.google.com (mail-it0-f53.google.com [209.85.214.53]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3E4F35F24C for ; Fri, 24 Feb 2017 08:28:29 +0000 (UTC) Received: by mail-it0-f53.google.com with SMTP id 203so12976166ith.0 for ; Fri, 24 Feb 2017 00:28:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=hfiRwbz8GWq7ap7SXw/+YniiuwiQrGEqjspMAaO3p00=; b=oKj+85Uzk8YAaowS8AMiad6yPOdTADC9VERqv8i9CTg4vJRNZDb32pnqIkdG0x53dJ dDJpEZvLTwG2t5UUeBetx8OEGbSPiEVRzYWIVs+Sb9iLBFo8nnYJcwf/YANuS21Sca7l /NSld0prIqLSG4odkTDsfYvEZ9l075KaMrtBo4e0jU/7BiT8MRVzC5WxYPOchXbuSrT8 qFU1C1RFCidmgYH3swJVWvbEyS4zMaRR/tkUm7DbB5fd5r7NbOKIcWmFN3IecI+aGfQc 8x1fVdM7mUwNV7U5n5R9QWSztnkjxX+CRIkWJzw6WtrDZznKoeojC6U4rmbTZogx7H8a 4K1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=hfiRwbz8GWq7ap7SXw/+YniiuwiQrGEqjspMAaO3p00=; b=loruBh1sZe+rZcU50WNGuv8HU2J0o+QhS3PVq0fkQnIqB7edpBU1G1yaBO4/62D1kC vMHLqrTuoacvaHek7o5uswKrsCPPGCXQ+ln3IjQSoLgn6Wsx7O/s3AmX34HF+IasEJE1 HSHSI8ih3bLfWVBxx1lnNvDZOB3R5PHEGcB4U5/adD7bYqlYzF11rLYcBJpfuKm1a0iY yxg3XZeT5d1R1BWvrqTSqnIh8r3hCH59LgREhjfKY9BP3LmyUDL0dRqCfFZLSuWYZ5Kq CeJbfZcZWk25oMkBnhP7gRqhSRZjHGx2kR/GklqCpqqghK1ZDkJI1WAI8fzEiMfcblNK wwMA== X-Gm-Message-State: AMke39mQG0hEPZvXMJPdkDL/XspLrTmrxwqYEgULjMVR5gy4k6EDv9MfOcUeddfLflPPkrPD6iZ1AUXj8wvq9g== X-Received: by 10.36.121.136 with SMTP id z130mr1465515itc.66.1487924907882; Fri, 24 Feb 2017 00:28:27 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Nick Pentreath Date: Fri, 24 Feb 2017 08:28:16 +0000 Message-ID: Subject: Re: Feedback on MLlib roadmap process proposal To: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=001a114a91ce9641d50549428384 archived-at: Fri, 24 Feb 2017 08:28:38 -0000 --001a114a91ce9641d50549428384 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable FYI I've started going through a few of the top Watched JIRAs and tried to identify those that are obviously stale and can probably be closed, to try to clean things up a bit. On Thu, 23 Feb 2017 at 21:38 Tim Hunter wrote: > As Sean wrote very nicely above, the changes made to Spark are decided in > an organic fashion based on the interests and motivations of the committe= rs > and contributors. The case of deep learning is a good example. There is a > lot of interest, and the core algorithms could be implemented without too > much problem in a few thousands of lines of scala code. However, the > performance of such a simple implementation would be one to two order of > magnitude slower than what would get from the popular frameworks out ther= e. > > At this point, there are probably more man-hours invested in TensorFlow > (as an example) than in MLlib, so I think we need to be realistic about > what we can expect to achieve inside Spark. Unlike BLAS for linear algebr= a, > there is no agreed-up interface for deep learning, and each of the XOnSpa= rk > flavors explores a slightly different design. It will be interesting to s= ee > what works well in practice. In the meantime, though, there are plenty of > things that we could do to help developers of other libraries to have a > great experience with Spark. Matei alluded to that in his Spark Summit > keynote when he mentioned better integration with low-level libraries. > > Tim > > > On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath > wrote: > > Sorry for being late to the discussion. I think Joseph, Sean and others > have covered the issues well. > > Overall I like the proposed cleaned up roadmap & process (thanks Joseph!)= . > As for the actual critical roadmap items mentioned on SPARK-18813, I thin= k > it makes sense and will comment a bit further on that JIRA. > > I would like to encourage votes & watching for issues to give a sense of > what the community wants (I guess Vote is more explicit yet passive, whil= e > actually Watching an issue is more informative as it may indicate a real > use case dependent on the issue?!). > > I think if used well this is valuable information for contributors. Of > course not everything on that list can get done. But if I look through th= e > top votes or watch list, while not all of those are likely to go in, a > great many of the issues are fairly non-contentious in terms of being goo= d > additions to the project. > > Things like these are good examples IMO (I just sample a few of them, not > exhaustive): > - sample weights for RF / DT > - multi-model and/or parallel model selection > - make sharedParams public? > - multi-column support for various transformers > - incremental model training > - tree algorithm enhancements > > Now, whether these can be prioritised in terms of bandwidth available to > reviewers and committers is a totally different thing. But as Sean mentio= ns > there is some process there for trying to find the balance of the issue > being a "good thing to add", a shepherd with bandwidth & interest in the > issue to review, and the maintenance burden imposed. > > Let's take Deep Learning / NN for example. Here's a good example of > something that has a lot of votes/watchers and as Sean mentions it is > something that "everyone wants someone else to implement". In this case, > much of the interest may in fact be "stale" - 2 years ago it would have > been very interesting to have a strong DL impl in Spark. Now, because the= re > are a plethora of very good DL libraries out there, how many of those Vot= es > would be "deleted"? Granted few are well integrated with Spark but that c= an > and is changing (DL4J, BigDL, the "XonSpark" flavours etc). > > So this is something that I dare say will not be in Spark any time in the > foreseeable future or perhaps ever given the current status. Perhaps it's > worth seriously thinking about just closing these kind of issues? > > > > On Fri, 27 Jan 2017 at 05:53 Joseph Bradley wrote= : > > Sean has given a great explanation. A few more comments: > > Roadmap: I have been creating roadmap JIRAs, but the goal really is to > have all committers working on MLlib help to set that roadmap, based on > either their knowledge of current maintenance/internal needs of the proje= ct > or the feedback given from the rest of the community. > @Committers - I see people actively shepherding PRs for MLlib, but I don'= t > see many major initiatives linked to the roadmap. If there are ones larg= e > enough to merit adding to the roadmap, please do. > > In general, there are many process improvements we could make. A few in > my mind are: > * Visibility: Let the community know what committers are focusing on. > This was the primary purpose of the "MLlib roadmap proposal." > * Community initiatives: This is currently very organic. Some of the > organic process could be improved, such as encouraging Votes/Watchers > (though I agree with Sean about these being one-sided metrics). Cody's S= IP > work is a great step towards adding more clarity and structure for major > initiatives. > * JIRA hygiene: Always a challenge, and always requires some manual > prodding. But it's great to push for efforts on this. > > > On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen wrote: > > On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach wrote: > > My confusion was that the ML 2.2 roadmap critical features ( > https://issues.apache.org/jira/browse/SPARK-18813) did not line up with > the top ML/MLLIB JIRAs by Votes > or > Watchers > > . > > Your explanation that they do not have to and there is a more complex > process to choosing the changes that will make it into the next release > makes sense to me. > > > For Spark ML, Joseph is the de facto leader and does publish a tentative > roadmap. (We could also use JIRA mechanisms for this but any scheme is > better than none.) Yes, not based on Votes -- nothing here is. Votes are > noisy signal because it is usually measures: what would you like done if > you didn't have to do it and there were no downsides for you? > > > > My only humble recommendation would be to cleanup the top JIRAs by closin= g > the ones which have spark packages for them (eg the NN one which already > has several packages as you explained), noting or somehow marking on some > that they will not be resolved, and changing the component on the ones no= t > related to ML/MLLIB (eg https://issues.apache.org/jira/browse/SPARK-12965 > ). > > > We do that. It occasionally generates protests, so, I find myself erring > on the side of ignoring. You can comment on any JIRA you think should be > closed. That's helpful. > > That particular JIRA seems potentially legitimate. I wouldn't close it. I= t > also won't get fixed until someone proposes a resolution. I'd strongly > encourage people saying "I have this problem too" to try to fix it. I ten= d > to ignore these otherwise, myself, in favor of reviewing ones where someo= ne > has gone to the trouble of proposing a working fix. > > > > Also, I would love to do this if I had the permissions, but it would be > great to change the JIRAs that are marked as =E2=80=9Cin progress=E2=80= =9D but where the > corresponding pull request was closed/cancelled, for example > https://issues.apache.org/jira/browse/SPARK-4638. That JIRA is > > > Yes, flag these. I or others can close them if appropriate. Anyone who > consistently does this well, we could give JIRA permissions to. > > Opening a PR automatically makes it "In Progress" but there's no > complementary process to un-mark it. You can ignore the Open / In Progres= s > distinction really. > > This one is interesting because it does seem like a plausible feature to > add. The original PR was abandoned by the author and nobody else submitte= d > one -- despite the Votes. I hesitate to signal that no PRs would be > considered, but, doesn't seem like it's in demand enough for someone to > work on? > > > I think one of my messages is that, de facto, here, like in many Apache > projects, committers do not take requests. They pursue the work they > believe needs doing, and shepherd work initiated by others (a clear bug > report, a PR) to a resolution. Things get done by doing them, or by > building influence by doing other things the project needs doing. It isn'= t > a mechanical, objective process, and can't be. But it does work in a > recognizable way. > > > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] > > > --001a114a91ce9641d50549428384 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
FYI I've started going through a few of the top Watche= d JIRAs and tried to identify those that are obviously stale and can probab= ly be closed, to try to clean things up a bit.

On Thu, 23 Feb 2017 at 21:38 Tim Hunter <timhunter@databricks.com> wrote= :
As Sean wrote very nicely above, the changes made to Spark are decided = in an organic fashion based on the interests and motivations of the committ= ers and contributors. The case of deep learning is a good example. There is= a lot of interest, and the core algorithms could be implemented without to= o much problem in a few thousands of lines of scala code. However, the perf= ormance of such a simple implementation would be one to two order of magnit= ude slower than what would get from the popular frameworks out there.

A= t this point, there are probably more man-hours invested in TensorFlow (as = an example) than in MLlib, so I think we need to be realistic about what we= can expect to achieve inside Spark. Unlike BLAS for linear algebra, there = is no agreed-up interface for deep learning, and each of the XOnSpark flavo= rs explores a slightly different design. It will be interesting to see what= works well in practice. In the meantime, though, there are plenty of thing= s that we could do to help developers of other libraries to have a great ex= perience with Spark. Matei alluded to that in his Spark Summit keynote when= he mentioned better integration with low-level libraries.

Tim


On Thu, Feb 23, 2017= at 5:32 AM, Nick Pentreath <nick.pentreath@gmail.com> wrote:
= Sorry for being late to the discussion. I think Joseph, Sean and others hav= e covered the issues well.=C2=A0

Overall I like the proposed cleaned up= roadmap & process (thanks Joseph!). As for the actual critical roadmap= items mentioned on=C2=A0SPARK-18813, I think it makes sense and will comme= nt a bit further on that JIRA.

I would like to encourage votes &a= mp; watching for issues to give a sense of what the community wants (I gues= s Vote is more explicit yet passive, while actually Watching an issue is mo= re informative as it may indicate a real use case dependent on the issue?!)= .

I think if used well this is valuable information for contribut= ors. Of course not everything on that list can get done. But if I look thro= ugh the top votes or watch list, while not all of those are likely to go in= , a great many of the issues are fairly non-contentious in terms of being g= ood additions to the project.

Things like these are good examples= IMO (I just sample a few of them, not exhaustive):
- sample weights for RF / DT
- multi-m= odel and/or parallel model selection
- make s= haredParams public?
- multi-column support fo= r various transformers
- incremental model tr= aining
- tree algorithm enhancements

Now, whether these can be prioritised in terms of bandwidth available to = reviewers and committers is a totally different thing. But as Sean mentions= there is some process there for trying to find the balance of the issue be= ing a "good thing to add", a shepherd with bandwidth & intere= st in the issue to review, and the maintenance burden imposed.

Le= t's take Deep Learning / NN for example. Here's a good example of s= omething that has a lot of votes/watchers and as Sean mentions it is someth= ing that "everyone wants someone else to implement". In this case= , much of the interest may in fact be "stale" - 2 years ago it wo= uld have been very interesting to have a strong DL impl in Spark. Now, beca= use there are a plethora of very good DL libraries out there, how many of t= hose Votes would be "deleted"? Granted few are well integrated wi= th Spark but that can and is changing (DL4J, BigDL, the "XonSpark"= ; flavours etc).=C2=A0

So this is something that I dare say will = not be in Spark any time in the foreseeable future or perhaps ever given th= e current status. Perhaps it's worth seriously thinking about just clos= ing these kind of issues?



On Fri= , 27 Jan 2017 at 05:53 Joseph Bradley <joseph@databricks.com>= wrote:
Sean has given a great explanation.=C2=A0 A few more comme= nts:

Roadmap: I have been creating roadmap JIRAs, but the goal re= ally is to have all committers working on MLlib help to set that roadmap, b= ased on either their knowledge of current maintenance/internal needs of the= project or the feedback given from the rest of the community.
@Comm= itters - I see people actively shepherding PRs for MLlib, but I don't s= ee many major initiatives linked to the roadmap.=C2=A0 If there are ones la= rge enough to merit adding to the roadmap, please do.

In ge= neral, there are many process improvements we could make.=C2=A0 A few in my= mind are:
* Visibility: Let the community know what committers are = focusing on.=C2=A0 This was the primary purpose of the "MLlib roadmap = proposal."
* Community initiatives: This is currently very orga= nic.=C2=A0 Some of the organic process could be improved, such as encouragi= ng Votes/Watchers (though I agree with Sean about these being one-sided met= rics).=C2=A0 Cody's SIP work is a great step towards adding more clarit= y and structure for major initiatives.
* JIRA hygiene: Always a chal= lenge, and always requires some manual prodding.=C2=A0 But it's great t= o push for efforts on this.


On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen <sowen@cloude= ra.com> wrote:
=
On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach <ilmat@microsoft.com> = wrote:

My confusion was that the ML 2.2 roadmap criti= cal features (https://issues.apache.org/jira/browse/SP= ARK-18813) did not line up with the top ML/MLLIB JIRAs by Votes or Watchers.

Your explanation that they do not have to and = there is a more complex process to choosing the changes that will make it i= nto the next release makes sense to me.


For Spark ML, Joseph is the de facto leader and does publish a t= entative roadmap. (We could also use JIRA mechanisms for this but any schem= e is better than none.) Yes, not based on Votes -- nothing here is. Votes a= re noisy signal because it is usually measures: what would you like done if= you didn't have to do it and there were no downsides for you?
=
=C2=A0

We do that.= It occasionally generates protests, so, I find myself erring on the side o= f ignoring. You can comment on any JIRA you think should be closed. That= 9;s helpful.

That particular JIRA seems potentially legitim= ate. I wouldn't close it. It also won't get fixed until someone pro= poses a resolution. I'd strongly encourage people saying "I have t= his problem too" to try to fix it. I tend to ignore these otherwise, m= yself, in favor of reviewing ones where someone has gone to the trouble of = proposing a working fix.

=C2=A0

=

Also, I would love to do= this if I had the permissions, but it would be great to change the JIRAs t= hat are marked as =E2=80=9Cin progress=E2=80=9D but where the corresponding= pull request was closed/cancelled, for example https://issue= s.apache.org/jira/browse/SPARK-4638= .=C2=A0 That JIRA is


Yes, fl= ag these. I or others can close them if appropriate. Anyone who consistentl= y does this well, we could give JIRA permissions to.

Openin= g a PR automatically makes it "In Progress" but there's no co= mplementary process to un-mark it. You can ignore the Open / In Progress di= stinction really.

This one is interesting because it does s= eem like a plausible feature to add. The original PR was abandoned by the a= uthor and nobody else submitted one -- despite the Votes. I hesitate to sig= nal that no PRs would be considered, but, doesn't seem like it's in= demand enough for someone to work on?


I think one= of my messages is that, de facto, here, like in many Apache projects, comm= itters do not take requests. They pursue the work they believe needs doing,= and shepherd work initiated by others (a clear bug report, a PR) to a reso= lution. Things get done by doing them, or by building influence by doing ot= her things the project needs doing. It isn't a mechanical, objective pr= ocess, and can't be. But it does work in a recognizable way.



--

Joseph B= radley

Software Engineer - Machine Learn= ing

Dat= abricks, Inc.

3D"http://databricks.com"


--001a114a91ce9641d50549428384--