spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanbo Liang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-18813) MLlib 2.2 Roadmap
Date Mon, 12 Dec 2016 08:40:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yanbo Liang updated SPARK-18813:
--------------------------------
    Description: 
*PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
The roadmap process described below is significantly updated since the 2.1 roadmap [SPARK-15581].
 Please refer to [SPARK-15581] for more discussion on the basis for this proposal, and comment
in this JIRA if you have suggestions for improvements.

h1. Roadmap process

This roadmap is a master list for MLlib improvements we are working on during this release.
 This includes ML-related changes in PySpark and SparkR.

*What is planned for the next release?*
* This roadmap lists issues which at least one Committer has prioritized.  See details below
in "Instructions for committers."
* This roadmap only lists larger or more critical issues.

*How can contributors influence this roadmap?*
* If you believe an issue should be in this roadmap, please discuss the issue on JIRA and/or
the dev mailing list.  Make sure to ping Committers since at least one must agree to shepherd
the issue.
* For general discussions, use this JIRA or the dev mailing list.  For specific issues, please
comment on those issues or the mailing list.

h2. Target Version and Priority

This section describes the meaning of Target Version and Priority.  _These meanings have been
updated in this proposal for the 2.2 process._

|| Category | Target Version | Priority | Shepherd | Put on roadmap? | In next release? ||
| 1 | next release | Blocker | *must* | *must* | *must* |
| 2 | next release | Critical | *must* | yes, unless small | *best effort* |
| 3 | next release | Major | *must* | optional | *best effort* |
| 4 | next release | Minor | optional | no | maybe |
| 5 | next release | Trivial | optional | no | maybe |
| 6 | (empty) | (any) | yes | no | maybe |
| 7 | (empty) | (any) | no | no | maybe |

The *Category* in the table above has the following meaning:

1. A committer has promised to see this issue to completion for the next release.  Contributions
*will* receive attention.
2-3. A committer has promised to see this issue to completion for the next release.  Contributions
*will* receive attention.  The issue may slip to the next release if development is slower
than expected.
4-5. A committer has promised interest in this issue.  Contributions *will* receive attention.
 The issue may slip to another release.
6. A committer has promised interest in this issue and should respond, but no promises are
made about priorities or releases.
7. This issue is open for discussion, but it needs a committer to promise interest to proceed.

h1. Instructions

h2. For contributors

Getting started
* Please read http://spark.apache.org/contributing.html carefully. Code style, documentation,
and unit tests are important.
* If you are a first-time contributor, please always start with a small [starter task|https://issues.apache.org/jira/issues/?filter=12333209]
rather than a larger feature.

Coordinating on JIRA
* Never work silently. Let everyone know on the corresponding JIRA page when you start work.
This is to avoid duplicate work. For small patches, you do not need to get the JIRA assigned
to you to begin work.
* For medium/large features or features with dependencies, please get assigned first before
coding and keep the ETA updated on the JIRA. If there is no activity on the JIRA page for
a certain amount of time, the JIRA should be released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another.
* Do not set these fields: Target Version, Fix Version, or Shepherd.  Only Committers should
set those.

Writing and reviewing PRs
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps
to improve others' code as well as yours.*

h2. For Committers

Adding to this roadmap
* You can update the roadmap by (a) adding issues to this list and (b) setting Target Versions.
 Only Committers may make these changes.
* *If you add an issue to this roadmap or set a Target Version, you _must_ assign yourself
or another Committer as Shepherd.*
* This list should be actively managed during the release.
* If you target a significant item for the next release, please list the item on this roadmap.
* If you commit to shepherding a new public API, you implicitly commit to shepherding the
follow-up issues as well (Python/R APIs, docs).

Creating JIRA issues
* Try to break down big features into small and specific JIRA tasks and link them properly.
* Add a "starter" label to starter tasks.
* Put a rough time estimate for medium/big features and track the progress.
* Set Priority carefully.  Priority should not be mixed with size of effort for implementation.

Managing JIRA issues and PRs
* Please add yourself to the Shepherd field on JIRA if you start reviewing a PR.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a
Committer experienced with the relevant code to make a final pass.

Follow-up issues: *After merging a PR, create and link the necessary follow-up JIRAs.*
* For a new Scala/Java API
** Create issues for adding analogous Python and R APIs
** Create issues for adding example code and documentation
* For a new Python/R API
** Create issues for adding example code and documentation

h1. Roadmap for this release

This roadmap only includes larger, more critical tasks targeted at the next release.  To find
all issues targeted for the next release, use the links listed below.

Notes
* We will prioritize API parity, bug fixes, and improvements over new features.
* The RDD-based API (`spark.mllib`) is in maintenance mode now.  We will accept bug fixes
for it, but new features, APIs, and improvements will only be added to the DataFrame-based
API (`spark.ml`).

*WIP: This section is still being updated, pending confirmation of the Roadmap Process described
above.*

h2. Critical feature parity in DataFrame-based API
* Umbrella JIRA: [SPARK-4591]

h2. Persistence
* Complete persistence within MLlib
** Python tuning (SPARK-13786)
* MLlib in R format: compatibility with other languages (SPARK-15572)
* Impose backwards compatibility for persistence (SPARK-15573)

h2. SparkR
* Release SparkR on CRAN [SPARK-15799]

h2. Other prioritized issues: links for searching JIRA

This section provides links to help people identify smaller patches targeted at the next release,
as well as patches for major areas within MLlib.

* [All MLlib, SparkR, GraphX JIRAs with Target Version 2.2 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.2.0)%20ORDER%20BY%20priority]

* [MLlib, SparkR, GraphX Umbrella JIRAs (regardless of Target Version) | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20Type%20%3D%20%22Umbrella%22%20AND%20Status%20in%20(%22Open%22%2C%20%22In%20Progress%22%2C%20%22Reopened%22)]

h1. Long-term roadmap

This section lists long-term or constant efforts.  For example, Python/R API parity with Scala/Java
will always be a priority, but we do not promise exact parity with each release.

h2. Python and R feature parity

Python feature parity: The main goal of the Python API is to have feature parity with the
Scala/Java API. You can find a [complete list of Python MLlib issues targeted at the next
release here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC].

R feature parity: We are building towards feature parity in SparkR as well. You can find a
[complete list of SparkR MLlib issues targeted at the next release here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20"Target%20Version%2Fs"%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC].


  was:
*PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
The roadmap process described below is significantly updated since the 2.1 roadmap [SPARK-15581].
 Please refer to [SPARK-15581] for more discussion on the basis for this proposal, and comment
in this JIRA if you have suggestions for improvements.

h1. Roadmap process

This roadmap is a master list for MLlib improvements we are working on during this release.
 This includes ML-related changes in PySpark and SparkR.

*What is planned for the next release?*
* This roadmap lists issues which at least one Committer has prioritized.  See details below
in "Instructions for committers."
* This roadmap only lists larger or more critical issues.

*How can contributors influence this roadmap?*
* If you believe an issue should be in this roadmap, please discuss the issue on JIRA and/or
the dev mailing list.  Make sure to ping Committers since at least one must agree to shepherd
the issue.
* For general discussions, use this JIRA or the dev mailing list.  For specific issues, please
comment on those issues or the mailing list.

h2. Target Version and Priority

This section describes the meaning of Target Version and Priority.  _These meanings have been
updated in this proposal for the 2.2 process._

|| Category | Target Version | Priority | Shepherd | Put on roadmap? | In next release? ||
| 1 | next release | Blocker | *must* | *must* | *must* |
| 2 | next release | Critical | *must* | yes, unless small | *best effort* |
| 3 | next release | Major | *must* | optional | *best effort* |
| 4 | next release | Minor | optional | no | maybe |
| 5 | next release | Trivial | optional | no | maybe |
| 6 | (empty) | (any) | yes | no | maybe |
| 7 | (empty) | (any) | no | no | maybe |

The *Category* in the table above has the following meaning:

1. A committer has promised to see this issue to completion for the next release.  Contributions
*will* receive attention.
2-3. A committer has promised to see this issue to completion for the next release.  Contributions
*will* receive attention.  The issue may slip to the next release if development is slower
than expected.
4-5. A committer has promised interest in this issue.  Contributions *will* receive attention.
 The issue may slip to another release.
6. A committer has promised interest in this issue and should respond, but no promises are
made about priorities or releases.
7. This issue is open for discussion, but it needs a committer to promise interest to proceed.

h1. Instructions

h2. For contributors

Getting started
* Please read http://spark.apache.org/contributing.html. Code style, documentation, and unit
tests are important.
* If you are a first-time contributor, please always start with a small [starter task|https://issues.apache.org/jira/issues/?filter=12333209]
rather than a larger feature.

Coordinating on JIRA
* Never work silently. Let everyone know on the corresponding JIRA page when you start work.
This is to avoid duplicate work. For small patches, you do not need to get the JIRA assigned
to you to begin work.
* For medium/large features or features with dependencies, please get assigned first before
coding and keep the ETA updated on the JIRA. If there is no activity on the JIRA page for
a certain amount of time, the JIRA should be released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another.
* Do not set these fields: Target Version, Fix Version, or Shepherd.  Only Committers should
set those.

Writing and reviewing PRs
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps
to improve others' code as well as yours.*

h2. For Committers

Adding to this roadmap
* You can update the roadmap by (a) adding issues to this list and (b) setting Target Versions.
 Only Committers may make these changes.
* *If you add an issue to this roadmap or set a Target Version, you _must_ assign yourself
or another Committer as Shepherd.*
* This list should be actively managed during the release.
* If you target a significant item for the next release, please list the item on this roadmap.
* If you commit to shepherding a new public API, you implicitly commit to shepherding the
follow-up issues as well (Python/R APIs, docs).

Creating JIRA issues
* Try to break down big features into small and specific JIRA tasks and link them properly.
* Add a "starter" label to starter tasks.
* Put a rough time estimate for medium/big features and track the progress.
* Set Priority carefully.  Priority should not be mixed with size of effort for implementation.

Managing JIRA issues and PRs
* Please add yourself to the Shepherd field on JIRA if you start reviewing a PR.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a
Committer experienced with the relevant code to make a final pass.

Follow-up issues: *After merging a PR, create and link the necessary follow-up JIRAs.*
* For a new Scala/Java API
** Create issues for adding analogous Python and R APIs
** Create issues for adding example code and documentation
* For a new Python/R API
** Create issues for adding example code and documentation

h1. Roadmap for this release

This roadmap only includes larger, more critical tasks targeted at the next release.  To find
all issues targeted for the next release, use the links listed below.

Notes
* We will prioritize API parity, bug fixes, and improvements over new features.
* The RDD-based API (`spark.mllib`) is in maintenance mode now.  We will accept bug fixes
for it, but new features, APIs, and improvements will only be added to the DataFrame-based
API (`spark.ml`).

*WIP: This section is still being updated, pending confirmation of the Roadmap Process described
above.*

h2. Critical feature parity in DataFrame-based API
* Umbrella JIRA: [SPARK-4591]

h2. Persistence
* Complete persistence within MLlib
** Python tuning (SPARK-13786)
* MLlib in R format: compatibility with other languages (SPARK-15572)
* Impose backwards compatibility for persistence (SPARK-15573)

h2. SparkR
* Release SparkR on CRAN [SPARK-15799]

h2. Other prioritized issues: links for searching JIRA

This section provides links to help people identify smaller patches targeted at the next release,
as well as patches for major areas within MLlib.

* [All MLlib, SparkR, GraphX JIRAs with Target Version 2.2 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.2.0)%20ORDER%20BY%20priority]

* [MLlib, SparkR, GraphX Umbrella JIRAs (regardless of Target Version) | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20Type%20%3D%20%22Umbrella%22%20AND%20Status%20in%20(%22Open%22%2C%20%22In%20Progress%22%2C%20%22Reopened%22)]

h1. Long-term roadmap

This section lists long-term or constant efforts.  For example, Python/R API parity with Scala/Java
will always be a priority, but we do not promise exact parity with each release.

h2. Python and R feature parity

Python feature parity: The main goal of the Python API is to have feature parity with the
Scala/Java API. You can find a [complete list of Python MLlib issues targeted at the next
release here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC].

R feature parity: We are building towards feature parity in SparkR as well. You can find a
[complete list of SparkR MLlib issues targeted at the next release here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20"Target%20Version%2Fs"%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC].



> MLlib 2.2 Roadmap
> -----------------
>
>                 Key: SPARK-18813
>                 URL: https://issues.apache.org/jira/browse/SPARK-18813
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Blocker
>              Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 roadmap [SPARK-15581].
 Please refer to [SPARK-15581] for more discussion on the basis for this proposal, and comment
in this JIRA if you have suggestions for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during this release.
 This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  See details
below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue on JIRA
and/or the dev mailing list.  Make sure to ping Committers since at least one must agree to
shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For specific issues,
please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These meanings have
been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In next release?
||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next release.  Contributions
*will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next release. 
Contributions *will* receive attention.  The issue may slip to the next release if development
is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* receive attention.
 The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no promises
are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise interest to
proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read http://spark.apache.org/contributing.html carefully. Code style, documentation,
and unit tests are important.
> * If you are a first-time contributor, please always start with a small [starter task|https://issues.apache.org/jira/issues/?filter=12333209]
rather than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when you start
work. This is to avoid duplicate work. For small patches, you do not need to get the JIRA
assigned to you to begin work.
> * For medium/large features or features with dependencies, please get assigned first
before coding and keep the ETA updated on the JIRA. If there is no activity on the JIRA page
for a certain amount of time, the JIRA should be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after
another.
> * Do not set these fields: Target Version, Fix Version, or Shepherd.  Only Committers
should set those.
> Writing and reviewing PRs
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly
helps to improve others' code as well as yours.*
> h2. For Committers
> Adding to this roadmap
> * You can update the roadmap by (a) adding issues to this list and (b) setting Target
Versions.  Only Committers may make these changes.
> * *If you add an issue to this roadmap or set a Target Version, you _must_ assign yourself
or another Committer as Shepherd.*
> * This list should be actively managed during the release.
> * If you target a significant item for the next release, please list the item on this
roadmap.
> * If you commit to shepherding a new public API, you implicitly commit to shepherding
the follow-up issues as well (Python/R APIs, docs).
> Creating JIRA issues
> * Try to break down big features into small and specific JIRA tasks and link them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough time estimate for medium/big features and track the progress.
> * Set Priority carefully.  Priority should not be mixed with size of effort for implementation.
> Managing JIRA issues and PRs
> * Please add yourself to the Shepherd field on JIRA if you start reviewing a PR.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping
a Committer experienced with the relevant code to make a final pass.
> Follow-up issues: *After merging a PR, create and link the necessary follow-up JIRAs.*
> * For a new Scala/Java API
> ** Create issues for adding analogous Python and R APIs
> ** Create issues for adding example code and documentation
> * For a new Python/R API
> ** Create issues for adding example code and documentation
> h1. Roadmap for this release
> This roadmap only includes larger, more critical tasks targeted at the next release.
 To find all issues targeted for the next release, use the links listed below.
> Notes
> * We will prioritize API parity, bug fixes, and improvements over new features.
> * The RDD-based API (`spark.mllib`) is in maintenance mode now.  We will accept bug fixes
for it, but new features, APIs, and improvements will only be added to the DataFrame-based
API (`spark.ml`).
> *WIP: This section is still being updated, pending confirmation of the Roadmap Process
described above.*
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. SparkR
> * Release SparkR on CRAN [SPARK-15799]
> h2. Other prioritized issues: links for searching JIRA
> This section provides links to help people identify smaller patches targeted at the next
release, as well as patches for major areas within MLlib.
> * [All MLlib, SparkR, GraphX JIRAs with Target Version 2.2 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.2.0)%20ORDER%20BY%20priority]
> * [MLlib, SparkR, GraphX Umbrella JIRAs (regardless of Target Version) | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20Type%20%3D%20%22Umbrella%22%20AND%20Status%20in%20(%22Open%22%2C%20%22In%20Progress%22%2C%20%22Reopened%22)]
> h1. Long-term roadmap
> This section lists long-term or constant efforts.  For example, Python/R API parity with
Scala/Java will always be a priority, but we do not promise exact parity with each release.
> h2. Python and R feature parity
> Python feature parity: The main goal of the Python API is to have feature parity with
the Scala/Java API. You can find a [complete list of Python MLlib issues targeted at the next
release here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC].
> R feature parity: We are building towards feature parity in SparkR as well. You can find
a [complete list of SparkR MLlib issues targeted at the next release here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20"Target%20Version%2Fs"%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message