spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Matiach <>
Subject RE: Feedback on MLlib roadmap process proposal
Date Wed, 25 Jan 2017 06:01:37 GMT
Thanks Sean, this is a really helpful overview, and contains good guidance for new contributors
My confusion was that the ML 2.2 roadmap critical features (
did not line up with the top ML/MLLIB JIRAs by Votes <>
or Watchers<>.
Your explanation that they do not have to and there is a more complex process to choosing
the changes that will make it into the next release makes sense to me.
My only humble recommendation would be to cleanup the top JIRAs by closing the ones which
have spark packages for them (eg the NN one which already has several packages as you explained),
noting or somehow marking on some that they will not be resolved, and changing the component
on the ones not related to ML/MLLIB (eg
Also, I would love to do this if I had the permissions, but it would be great to change the
JIRAs that are marked as “in progress” but where the corresponding pull request was closed/cancelled,
for example  That JIRA is actually one of
the top ones by number of watches (adding kernels like Radial Basis Function to SVM, and I
can imagine why it’s one of the top ones), and seeing it marked as in progress with a pull
request is somewhat confusing.  I’ve seen several other JIRAs similar to this one, where
the pull request was closed but the JIRA status was not updated – and if the pull request
was closed for a good reason, the corresponding JIRA should probably be closed as well.
Thank you, Ilya

From: Sean Owen []
Sent: Tuesday, January 24, 2017 11:23 AM
To: Ilya Matiach <>
Subject: Re: Feedback on MLlib roadmap process proposal

On Tue, Jan 24, 2017 at 3:58 PM Ilya Matiach <<>>
Just a few questions with regards to the MLLIB process:

  1.  Is there a list of committers who can/are shepherds and what code they own?  I’ve
seen this page:<>
but I’m not sure if it is up to date and it doesn’t mention what code the committers own.
 It would be useful to know who owns ML or MLLIB.  From my limited personal experience this
seems to be Joseph K. Bradley, Yanbo Liang and Sean Owen.
There is no such list because there's no formal notion of ownership or access to subsets of
the project. Tracking an informal notion would be process mostly for its own sake, and probably
just go out of date. We sort of tried this with 'maintainers' and it didn't actually do anything.

I am not active much in ML, but will occasionally help commit simple changes. What you see
organically is pretty much what is, at any given time. People you see responding are the active
ones, and influencers, commit bit or no.

  2.  Based on both user votes and watchers, the top issue currently is “SPARK-5575: Artificial
neural networks for MLlib deep learning”.  However, it looks like it has been opened for
almost 2 years and not a lot of progress is being made.  There seem to be other top issues
which aren’t getting addressed as well on these pages mentioned in the roadmap: MLlib, sorted
by: Votes <>
or Watchers <>
.  Is my perception incorrect, or is there a very good reason for not addressing the top issues
voted for by the community?  If there is a good reason, is there a way to filter such JIRAs
out from the sorted lists, to know which JIRAs really should be taken/worked on?
JIRA votes and watchers don't mean anything, formally. This isn't a product company where
one group might give another group a list of top priorities to work on. There's a general
statement about this at<>
under "Code Review Criteria". In practice, it's a soft process of convincing other people
that change X does more good than harm, is worth taking the burden of supporting, matters
to users, etc. I ignore 80% of issues, that don't seem to fit these criteria, and choose to
help with the 20% that do, which are usually simple and/or important bug fixes.

ANNs? that's a tangent but my snap reaction are:
It's something Everybody wants Somebody Else to create, which may explain the votes vs activity?
There is one basic ANN implementation in Spark actually.
There are others outside Spark, so may be something people get elsewhere like dl4j or BigDL,
or strapping TF to Spark in various ways.
DL is also not an obviously-great fit for the data-parallel computation model here.
It's not a goal to implement everything in Spark. It could be a good idea, but, no need to
tether it to the core project, to the exclusion of "unblessed" third-party packages.

  2.  Also, this might be a newbie question, but for new contributors to spark, is there a
process to convince a committer to be assigned to a JIRA that we are working on. It would
be useful if there was a clear threshold for whether a committer can reject to work on a JIRA
ahead of time, so contributors won’t waste time working on issues that aren’t important
to spark and focus on making progress on the issues that the spark committers would like us
to fix.

No, there's no concept of being tasked to work on something by someone else here. I can't
imagine we could establish a clear objective threshold for such a subjective thing.

It's not a satisfying answer but it is the most realistic one. All of these OSS projects work
on soft power, persuasion and cooperation. I think the good news is that all the intuitive
ways to gain soft power do work: give time to others' problems if you want time on your own,
help review, make thoughtful careful changes, etc.

My general guidance is: don't bother doing significant feature work unless you have some clear
buy-in from someone who can commit.

I completely agree that issues should be closed more aggressively for the reason you give.
On the flip-side this often ruffles feathers. We are still overrun with issues but it's gotten
a lot better culture-wise about honestly rejecting lots of inbound stuff quickly.

View raw message