spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <>
Subject Re: Feedback on MLlib roadmap process proposal
Date Fri, 24 Feb 2017 08:28:16 GMT
FYI I've started going through a few of the top Watched JIRAs and tried to
identify those that are obviously stale and can probably be closed, to try
to clean things up a bit.

On Thu, 23 Feb 2017 at 21:38 Tim Hunter <> wrote:

> As Sean wrote very nicely above, the changes made to Spark are decided in
> an organic fashion based on the interests and motivations of the committers
> and contributors. The case of deep learning is a good example. There is a
> lot of interest, and the core algorithms could be implemented without too
> much problem in a few thousands of lines of scala code. However, the
> performance of such a simple implementation would be one to two order of
> magnitude slower than what would get from the popular frameworks out there.
> At this point, there are probably more man-hours invested in TensorFlow
> (as an example) than in MLlib, so I think we need to be realistic about
> what we can expect to achieve inside Spark. Unlike BLAS for linear algebra,
> there is no agreed-up interface for deep learning, and each of the XOnSpark
> flavors explores a slightly different design. It will be interesting to see
> what works well in practice. In the meantime, though, there are plenty of
> things that we could do to help developers of other libraries to have a
> great experience with Spark. Matei alluded to that in his Spark Summit
> keynote when he mentioned better integration with low-level libraries.
> Tim
> On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath <>
> wrote:
> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley <> wrote:
> Sean has given a great explanation.  A few more comments:
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
> work is a great step towards adding more clarity and structure for major
> initiatives.
> * JIRA hygiene: Always a challenge, and always requires some manual
> prodding.  But it's great to push for efforts on this.
> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen <> wrote:
> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach <> wrote:
> My confusion was that the ML 2.2 roadmap critical features (
> did not line up with
> the top ML/MLLIB JIRAs by Votes
> <>or
> Watchers
> <>
> .
> Your explanation that they do not have to and there is a more complex
> process to choosing the changes that will make it into the next release
> makes sense to me.
> For Spark ML, Joseph is the de facto leader and does publish a tentative
> roadmap. (We could also use JIRA mechanisms for this but any scheme is
> better than none.) Yes, not based on Votes -- nothing here is. Votes are
> noisy signal because it is usually measures: what would you like done if
> you didn't have to do it and there were no downsides for you?
> My only humble recommendation would be to cleanup the top JIRAs by closing
> the ones which have spark packages for them (eg the NN one which already
> has several packages as you explained), noting or somehow marking on some
> that they will not be resolved, and changing the component on the ones not
> related to ML/MLLIB (eg
> ).
> We do that. It occasionally generates protests, so, I find myself erring
> on the side of ignoring. You can comment on any JIRA you think should be
> closed. That's helpful.
> That particular JIRA seems potentially legitimate. I wouldn't close it. It
> also won't get fixed until someone proposes a resolution. I'd strongly
> encourage people saying "I have this problem too" to try to fix it. I tend
> to ignore these otherwise, myself, in favor of reviewing ones where someone
> has gone to the trouble of proposing a working fix.
> Also, I would love to do this if I had the permissions, but it would be
> great to change the JIRAs that are marked as “in progress” but where the
> corresponding pull request was closed/cancelled, for example
>  That JIRA is
> Yes, flag these. I or others can close them if appropriate. Anyone who
> consistently does this well, we could give JIRA permissions to.
> Opening a PR automatically makes it "In Progress" but there's no
> complementary process to un-mark it. You can ignore the Open / In Progress
> distinction really.
> This one is interesting because it does seem like a plausible feature to
> add. The original PR was abandoned by the author and nobody else submitted
> one -- despite the Votes. I hesitate to signal that no PRs would be
> considered, but, doesn't seem like it's in demand enough for someone to
> work on?
> I think one of my messages is that, de facto, here, like in many Apache
> projects, committers do not take requests. They pursue the work they
> believe needs doing, and shepherd work initiated by others (a clear bug
> report, a PR) to a resolution. Things get done by doing them, or by
> building influence by doing other things the project needs doing. It isn't
> a mechanical, objective process, and can't be. But it does work in a
> recognizable way.
> --
> Joseph Bradley
> Software Engineer - Machine Learning
> Databricks, Inc.
> [image:] <>

View raw message