spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jkbrad...@apache.org
Subject [2/2] spark git commit: [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
Date Fri, 15 Jul 2016 20:38:39 GMT
[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide

## What changes were proposed in this pull request?

Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
  * **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
  * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
  * **Reviewers**: I did not change any of the content of the migration guides.

Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
  * **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added

Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page

## How was this patch tested?

Generated docs locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14213 from jkbradley/ml-guide-2.0.

(cherry picked from commit 5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90686abb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90686abb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90686abb

Branch: refs/heads/branch-2.0
Commit: 90686abbddcbfa85a53abb6b87a1f53f1e9adbed
Parents: c5f9355
Author: Joseph K. Bradley <joseph@databricks.com>
Authored: Fri Jul 15 13:38:23 2016 -0700
Committer: Joseph K. Bradley <joseph@databricks.com>
Committed: Fri Jul 15 13:38:33 2016 -0700

----------------------------------------------------------------------
 docs/_data/menu-ml.yaml                 |   6 +-
 docs/_includes/nav-left-wrapper-ml.html |   4 +-
 docs/_layouts/global.html               |   2 +-
 docs/index.md                           |   4 +-
 docs/ml-advanced.md                     |   4 +-
 docs/ml-ann.md                          |   4 +-
 docs/ml-classification-regression.md    |  60 ++--
 docs/ml-clustering.md                   |   8 +-
 docs/ml-collaborative-filtering.md      |   4 +-
 docs/ml-decision-tree.md                |   4 +-
 docs/ml-ensembles.md                    |   4 +-
 docs/ml-features.md                     |   4 +-
 docs/ml-guide.md                        | 461 ++++++++++-----------------
 docs/ml-linear-methods.md               |   4 +-
 docs/ml-migration-guides.md             | 159 +++++++++
 docs/ml-pipeline.md                     | 245 ++++++++++++++
 docs/ml-survival-regression.md          |   4 +-
 docs/ml-tuning.md                       | 121 +++++++
 docs/mllib-classification-regression.md |   4 +-
 docs/mllib-clustering.md                |   4 +-
 docs/mllib-collaborative-filtering.md   |   4 +-
 docs/mllib-data-types.md                |   4 +-
 docs/mllib-decision-tree.md             |   4 +-
 docs/mllib-dimensionality-reduction.md  |   4 +-
 docs/mllib-ensembles.md                 |   4 +-
 docs/mllib-evaluation-metrics.md        |   4 +-
 docs/mllib-feature-extraction.md        |   4 +-
 docs/mllib-frequent-pattern-mining.md   |   4 +-
 docs/mllib-guide.md                     | 219 +------------
 docs/mllib-isotonic-regression.md       |   4 +-
 docs/mllib-linear-methods.md            |   4 +-
 docs/mllib-migration-guides.md          | 158 +--------
 docs/mllib-naive-bayes.md               |   4 +-
 docs/mllib-optimization.md              |   4 +-
 docs/mllib-pmml-model-export.md         |   4 +-
 docs/mllib-statistics.md                |   4 +-
 docs/programming-guide.md               |   2 +-
 docs/streaming-programming-guide.md     |   4 +-
 python/pyspark/ml/__init__.py           |   4 +-
 python/pyspark/ml/tests.py              |   2 +-
 python/pyspark/mllib/__init__.py        |   5 +-
 41 files changed, 814 insertions(+), 746 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/_data/menu-ml.yaml
----------------------------------------------------------------------
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index 3fd3ee2..0c6b9b2 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,5 +1,5 @@
-- text: "Overview: estimators, transformers and pipelines"
-  url: ml-guide.html
+- text: Pipelines
+  url: ml-pipeline.html
 - text: Extracting, transforming and selecting features
   url: ml-features.html
 - text: Classification and Regression
@@ -8,5 +8,7 @@
   url: ml-clustering.html
 - text: Collaborative filtering
   url: ml-collaborative-filtering.html
+- text: Model selection and tuning
+  url: ml-tuning.html
 - text: Advanced topics
   url: ml-advanced.html

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/_includes/nav-left-wrapper-ml.html
----------------------------------------------------------------------
diff --git a/docs/_includes/nav-left-wrapper-ml.html b/docs/_includes/nav-left-wrapper-ml.html
index e2d7eda..00ac6cc 100644
--- a/docs/_includes/nav-left-wrapper-ml.html
+++ b/docs/_includes/nav-left-wrapper-ml.html
@@ -1,8 +1,8 @@
 <div class="left-menu-wrapper">
     <div class="left-menu">
-        <h3><a href="ml-guide.html">spark.ml package</a></h3>
+        <h3><a href="ml-guide.html">MLlib: Main Guide</a></h3>
         {% include nav-left.html nav=include.nav-ml %}
-        <h3><a href="mllib-guide.html">spark.mllib package</a></h3>
+        <h3><a href="mllib-guide.html">MLlib: RDD-based API Guide</a></h3>
         {% include nav-left.html nav=include.nav-mllib %}
     </div>
 </div>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/_layouts/global.html
----------------------------------------------------------------------
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 2d0c3fd..d3bf082 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -74,7 +74,7 @@
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
                                 <li><a href="structured-streaming-programming-guide.html">Structured Streaming</a></li>
-                                <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
+                                <li><a href="ml-guide.html">MLlib (Machine Learning)</a></li>
                                 <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
                                 <li><a href="sparkr.html">SparkR (R on Spark)</a></li>
                             </ul>

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/index.md
----------------------------------------------------------------------
diff --git a/docs/index.md b/docs/index.md
index 7157afc..0cb8803 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -8,7 +8,7 @@ description: Apache Spark SPARK_VERSION_SHORT documentation homepage
 Apache Spark is a fast and general-purpose cluster computing system.
 It provides high-level APIs in Java, Scala, Python and R,
 and an optimized engine that supports general execution graphs.
-It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
 
 # Downloading
 
@@ -87,7 +87,7 @@ options for deployment:
 * Modules built on Spark:
   * [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
   * [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support for structured data and relational queries
-  * [MLlib](mllib-guide.html): built-in machine learning library
+  * [MLlib](ml-guide.html): built-in machine learning library
   * [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
 
 **API Docs:**

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-advanced.md
----------------------------------------------------------------------
diff --git a/docs/ml-advanced.md b/docs/ml-advanced.md
index 1c5f844..f5804fd 100644
--- a/docs/ml-advanced.md
+++ b/docs/ml-advanced.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Advanced topics - spark.ml
-displayTitle: Advanced topics - spark.ml
+title: Advanced topics
+displayTitle: Advanced topics
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-ann.md
----------------------------------------------------------------------
diff --git a/docs/ml-ann.md b/docs/ml-ann.md
index c2d9bd2..7c460c4 100644
--- a/docs/ml-ann.md
+++ b/docs/ml-ann.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Multilayer perceptron classifier - spark.ml
-displayTitle: Multilayer perceptron classifier - spark.ml
+title: Multilayer perceptron classifier
+displayTitle: Multilayer perceptron classifier
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md
index 3d6106b..7c2437e 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Classification and regression - spark.ml
-displayTitle: Classification and regression - spark.ml
+title: Classification and regression
+displayTitle: Classification and regression
 ---
 
 
@@ -22,37 +22,14 @@ displayTitle: Classification and regression - spark.ml
 \newcommand{\zero}{\mathbf{0}}
 \]`
 
+This page covers algorithms for Classification and Regression.  It also includes sections
+discussing specific classes of algorithms, such as linear methods, trees, and ensembles.
+
 **Table of Contents**
 
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
-In `spark.ml`, we implement popular linear methods such as logistic
-regression and linear least squares with $L_1$ or $L_2$ regularization.
-Refer to [the linear methods in mllib](mllib-linear-methods.html) for
-details about implementation and tuning.  We also include a DataFrame API for [Elastic
-net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
-of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
-and variable selection via the elastic
-net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
-Mathematically, it is defined as a convex combination of the $L_1$ and
-the $L_2$ regularization terms:
-`\[
-\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
-\]`
-By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
-regularization as special cases. For example, if a [linear
-regression](https://en.wikipedia.org/wiki/Linear_regression) model is
-trained with the elastic net parameter $\alpha$ set to $1$, it is
-equivalent to a
-[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
-On the other hand, if $\alpha$ is set to $0$, the trained model reduces
-to a [ridge
-regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
-We implement Pipelines API for both linear regression and logistic
-regression with elastic net regularization.
-
-
 # Classification
 
 ## Logistic regression
@@ -760,7 +737,34 @@ Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.ml.html#pyspa
 </div>
 </div>
 
+# Linear methods
+
+We implement popular linear methods such as logistic
+regression and linear least squares with $L_1$ or $L_2$ regularization.
+Refer to [the linear methods guide for the RDD-based API](mllib-linear-methods.html) for
+details about implementation and tuning; this information is still relevant.
 
+We also include a DataFrame API for [Elastic
+net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
+of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
+and variable selection via the elastic
+net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
+Mathematically, it is defined as a convex combination of the $L_1$ and
+the $L_2$ regularization terms:
+`\[
+\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
+\]`
+By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
+regularization as special cases. For example, if a [linear
+regression](https://en.wikipedia.org/wiki/Linear_regression) model is
+trained with the elastic net parameter $\alpha$ set to $1$, it is
+equivalent to a
+[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
+On the other hand, if $\alpha$ is set to $0$, the trained model reduces
+to a [ridge
+regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
+We implement Pipelines API for both linear regression and logistic
+regression with elastic net regularization.
 
 # Decision trees
 

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-clustering.md
----------------------------------------------------------------------
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index 8656eb4..8a0a61c 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -1,10 +1,12 @@
 ---
 layout: global
-title: Clustering - spark.ml
-displayTitle: Clustering - spark.ml
+title: Clustering
+displayTitle: Clustering
 ---
 
-In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).
+This page describes clustering algorithms in MLlib.
+The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
+about these algorithms.
 
 **Table of Contents**
 

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md
index 8bd75f3..1d02d69 100644
--- a/docs/ml-collaborative-filtering.md
+++ b/docs/ml-collaborative-filtering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Collaborative Filtering - spark.ml
-displayTitle: Collaborative Filtering - spark.ml
+title: Collaborative Filtering
+displayTitle: Collaborative Filtering
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-decision-tree.md
----------------------------------------------------------------------
diff --git a/docs/ml-decision-tree.md b/docs/ml-decision-tree.md
index a721d55..5e1eeb9 100644
--- a/docs/ml-decision-tree.md
+++ b/docs/ml-decision-tree.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Decision trees - spark.ml
-displayTitle: Decision trees - spark.ml
+title: Decision trees
+displayTitle: Decision trees
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-ensembles.md
----------------------------------------------------------------------
diff --git a/docs/ml-ensembles.md b/docs/ml-ensembles.md
index 303773e..97f1bdc 100644
--- a/docs/ml-ensembles.md
+++ b/docs/ml-ensembles.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Tree ensemble methods - spark.ml
-displayTitle: Tree ensemble methods - spark.ml
+title: Tree ensemble methods
+displayTitle: Tree ensemble methods
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 88fd291..e7d7ddf 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Extracting, transforming and selecting features - spark.ml
-displayTitle: Extracting, transforming and selecting features - spark.ml
+title: Extracting, transforming and selecting features
+displayTitle: Extracting, transforming and selecting features
 ---
 
 This section covers algorithms for working with features, roughly divided into these groups:

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-guide.md
----------------------------------------------------------------------
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index dae86d8..5abec63 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -1,323 +1,214 @@
 ---
 layout: global
-title: "Overview: estimators, transformers and pipelines - spark.ml"
-displayTitle: "Overview: estimators, transformers and pipelines - spark.ml"
+title: "MLlib: Main Guide"
+displayTitle: "Machine Learning Library (MLlib) Guide"
 ---
 
+MLlib is Spark's machine learning (ML) library.
+Its goal is to make practical machine learning scalable and easy.
+At a high level, it provides tools such as:
 
-`\[
-\newcommand{\R}{\mathbb{R}}
-\newcommand{\E}{\mathbb{E}}
-\newcommand{\x}{\mathbf{x}}
-\newcommand{\y}{\mathbf{y}}
-\newcommand{\wv}{\mathbf{w}}
-\newcommand{\av}{\mathbf{\alpha}}
-\newcommand{\bv}{\mathbf{b}}
-\newcommand{\N}{\mathbb{N}}
-\newcommand{\id}{\mathbf{I}}
-\newcommand{\ind}{\mathbf{1}}
-\newcommand{\0}{\mathbf{0}}
-\newcommand{\unit}{\mathbf{e}}
-\newcommand{\one}{\mathbf{1}}
-\newcommand{\zero}{\mathbf{0}}
-\]`
+* ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
+* Featurization: feature extraction, transformation, dimensionality reduction, and selection
+* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
+* Persistence: saving and load algorithms, models, and Pipelines
+* Utilities: linear algebra, statistics, data handling, etc.
 
+# Announcement: DataFrame-based API is primary API
 
-The `spark.ml` package aims to provide a uniform set of high-level APIs built on top of
-[DataFrames](sql-programming-guide.html#dataframes) that help users create and tune practical
-machine learning pipelines.
-See the [algorithm guides](#algorithm-guides) section below for guides on sub-packages of
-`spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.
+**The MLlib RDD-based API is now in maintenance mode.**
 
-**Table of contents**
+As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
+The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
 
-* This will become a table of contents (this text will be scraped).
-{:toc}
+*What are the implications?*
 
+* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
+* MLlib will not add new features to the RDD-based API.
+* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
+* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
+* The RDD-based API is expected to be removed in Spark 3.0.
 
-# Main concepts in Pipelines
+*Why is MLlib switching to the DataFrame-based API?*
 
-Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple
-algorithms into a single pipeline, or workflow.
-This section covers the key concepts introduced by the Spark ML API, where the pipeline concept is
-mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
+* DataFrames provide a more user-friendly API than RDDs.  The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
+* The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
+* DataFrames facilitate practical ML Pipelines, particularly feature transformations.  See the [Pipelines guide](ml-pipeline.md) for details.
 
-* **[`DataFrame`](ml-guide.html#dataframe)**: Spark ML uses `DataFrame` from Spark SQL as an ML
-  dataset, which can hold a variety of data types.
-  E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
+# Dependencies
 
-* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
-E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
+MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
+[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
+If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
+implementation will be used instead.
 
-* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
-E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
+Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
+proxies by default.
+To configure `netlib-java` / Breeze to use system optimised binaries, include
+`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
+project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
+platform's additional installation instructions.
 
-* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
+To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
 
-* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
+[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
+    watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
 
-## DataFrame
+# Migration guide
 
-Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
-Spark ML adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
+MLlib is under active development.
+The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
+and the migration guide below will explain all changes between releases.
 
-`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) for a list of supported types.
-In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
+## From 1.6 to 2.0
 
-A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`.  See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
+### Breaking changes
 
-Columns in a `DataFrame` are named.  The code examples below use names such as "text," "features," and "label."
+There were several breaking changes in Spark 2.0, which are outlined below.
 
-## Pipeline components
+**Linear algebra classes for DataFrame-based APIs**
 
-### Transformers
+Spark's linear algebra dependencies were moved to a new project, `mllib-local` 
+(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). 
+As part of this change, the linear algebra classes were copied to a new package, `spark.ml.linalg`. 
+The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, 
+leading to a few breaking changes, predominantly in various model classes 
+(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list).
 
-A `Transformer` is an abstraction that includes feature transformers and learned models.
-Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
-another, generally by appending one or more columns.
-For example:
+**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`.
 
-* A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
-  column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
-* A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
-  label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
-  column.
+_Converting vectors and matrices_
 
-### Estimators
+While most pipeline components support backward compatibility for loading, 
+some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that contain vector or matrix 
+columns, may need to be migrated to the new `spark.ml` vector and matrix types. 
+Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg` types
+(and vice versa) can be found in `spark.mllib.util.MLUtils`.
 
-An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
-data.
-Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
-`Model`, which is a `Transformer`.
-For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
-`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.
-
-### Properties of pipeline components
-
-`Transformer.transform()`s and `Estimator.fit()`s are both stateless.  In the future, stateful algorithms may be supported via alternative concepts.
-
-Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).
-
-## Pipeline
-
-In machine learning, it is common to run a sequence of algorithms to process and learn from data.
-E.g., a simple text document processing workflow might include several stages:
-
-* Split each document's text into words.
-* Convert each document's words into a numerical feature vector.
-* Learn a prediction model using the feature vectors and labels.
-
-Spark ML represents such a workflow as a `Pipeline`, which consists of a sequence of
-`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
-We will use this simple workflow as a running example in this section.
-
-### How it works
-
-A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
-These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
-For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
-For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.
-
-We illustrate this for the simple text document workflow.  The figure below is for the *training time* usage of a `Pipeline`.
-
-<p style="text-align: center;">
-  <img
-    src="img/ml-Pipeline.png"
-    title="Spark ML Pipeline Example"
-    alt="Spark ML Pipeline Example"
-    width="80%"
-  />
-</p>
-
-Above, the top row represents a `Pipeline` with three stages.
-The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
-The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
-The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
-The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
-The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
-Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
-If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()`
-method on the `DataFrame` before passing the `DataFrame` to the next stage.
-
-A `Pipeline` is an `Estimator`.
-Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
-`Transformer`.
-This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
-
-<p style="text-align: center;">
-  <img
-    src="img/ml-PipelineModel.png"
-    title="Spark ML PipelineModel Example"
-    alt="Spark ML PipelineModel Example"
-    width="80%"
-  />
-</p>
-
-In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
-When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
-through the fitted pipeline in order.
-Each stage's `transform()` method updates the dataset and passes it to the next stage.
-
-`Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
-
-### Details
-
-*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array.  The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage.  It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG).  This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters).  If the `Pipeline` forms a DAG, then the stages must be specified in topological order.
-
-*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
-compile-time type checking.
-`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
-This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
-
-*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances.  E.g., the same instance
-`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
-unique IDs.  However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
-can be put into the same `Pipeline` since different instances will be created with different IDs.
-
-## Parameters
-
-Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
-
-A `Param` is a named parameter with self-contained documentation.
-A `ParamMap` is a set of (parameter, value) pairs.
-
-There are two main ways to pass parameters to an algorithm:
-
-1. Set parameters for an instance.  E.g., if `lr` is an instance of `LogisticRegression`, one could
-   call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
-   This API resembles the API used in `spark.mllib` package.
-2. Pass a `ParamMap` to `fit()` or `transform()`.  Any parameters in the `ParamMap` will override parameters previously specified via setter methods.
-
-Parameters belong to specific instances of `Estimator`s and `Transformer`s.
-For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
-This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
-
-## Saving and Loading Pipelines
-
-Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm's API documentation to see if saving and loading is supported.
-
-# Code examples
-
-This section gives code examples illustrating the functionality discussed above.
-For more info, please refer to the API documentation
-([Scala](api/scala/index.html#org.apache.spark.ml.package),
-[Java](api/java/org/apache/spark/ml/package-summary.html),
-and [Python](api/python/pyspark.ml.html)).
-Some Spark ML algorithms are wrappers for `spark.mllib` algorithms, and the
-[MLlib programming guide](mllib-guide.html) has details on specific algorithms.
-
-## Example: Estimator, Transformer, and Param
-
-This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
+There are also utility methods available for converting single instances of 
+vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / `mllib.linalg.Matrix`
+for converting to `ml.linalg` types, and 
+`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` 
+for converting to `mllib.linalg` types.
 
 <div class="codetabs">
+<div data-lang="scala"  markdown="1">
 
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
-</div>
+{% highlight scala %}
+import org.apache.spark.mllib.util.MLUtils
 
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
-</div>
+// convert DataFrame columns
+val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
+val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
+// convert a single vector or matrix
+val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
+val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
+{% endhighlight %}
 
-<div data-lang="python">
-{% include_example python/ml/estimator_transformer_param_example.py %}
-</div>
-
-</div>
-
-## Example: Pipeline
-
-This example follows the simple text document `Pipeline` illustrated in the figures above.
-
-<div class="codetabs">
-
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
-</div>
-
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java %}
-</div>
-
-<div data-lang="python">
-{% include_example python/ml/pipeline_example.py %}
-</div>
-
-</div>
-
-## Example: model selection via cross-validation
-
-An important task in ML is *model selection*, or using data to find the best model or parameters for a given task.  This is also called *tuning*.
-`Pipeline`s facilitate model selection by making it easy to tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
-
-Currently, `spark.ml` supports model selection using the [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) class, which takes an `Estimator`, a set of `ParamMap`s, and an [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator).
-`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets; e.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
-`CrossValidator` iterates through the set of `ParamMap`s. For each `ParamMap`, it trains the given `Estimator` and evaluates it using the given `Evaluator`.
-
-The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
-for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
-for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
-for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
-method in each of these evaluators.
-
-The `ParamMap` which produces the best evaluation metric (averaged over the `$k$` folds) is selected as the best model.
-`CrossValidator` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
-
-The following example demonstrates using `CrossValidator` to select from a grid of parameters.
-To help construct the parameter grid, we use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
-
-Note that cross-validation over a grid of parameters is expensive.
-E.g., in the example below, the parameter grid has 3 values for `hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` uses 2 folds.  This multiplies out to `$(3 \times 2) \times 2 = 12$` different models being trained.
-In realistic settings, it can be common to try many more parameters and use more folds (`$k=3$` and `$k=10$` are common).
-In other words, using `CrossValidator` can be very expensive.
-However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
-
-<div class="codetabs">
-
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
-</div>
-
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java %}
-</div>
-
-<div data-lang="python">
-
-{% include_example python/ml/cross_validator.py %}
-</div>
-
-</div>
-
-## Example: model selection via train validation split
-In addition to  `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
-`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
- the case of `CrossValidator`. It is therefore less expensive,
- but will not produce as reliable results when the training dataset is not sufficiently large.
-
-`TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in the `estimatorParamMaps` parameter,
-and an `Evaluator`.
-It begins by splitting the dataset into two parts using the `trainRatio` parameter
-which are used as separate training and test datasets. For example with `$trainRatio=0.75$` (default),
-`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
-Similar to `CrossValidator`, `TrainValidationSplit` also iterates through the set of `ParamMap`s.
-For each combination of parameters, it trains the given `Estimator` and evaluates it using the given `Evaluator`.
-The `ParamMap` which produces the best evaluation metric is selected as the best option.
-`TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
-
-<div class="codetabs">
-
-<div data-lang="scala" markdown="1">
-{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
+Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail.
 </div>
 
 <div data-lang="java" markdown="1">
-{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java %}
-</div>
 
-<div data-lang="python">
-{% include_example python/ml/train_validation_split.py %}
-</div>
+{% highlight java %}
+import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.sql.Dataset;
+
+// convert DataFrame columns
+Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
+Dataset<Row> convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
+// convert a single vector or matrix
+org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
+org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
+{% endhighlight %}
+
+Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
+</div>
+
+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+from pyspark.mllib.util import MLUtils
+
+# convert DataFrame columns
+convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
+convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
+# convert a single vector or matrix
+mlVec = mllibVec.asML()
+mlMat = mllibMat.asML()
+{% endhighlight %}
+
+Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further detail.
+</div>
+</div>
+
+**Deprecated methods removed**
+
+Several deprecated methods were removed in the `spark.mllib` and `spark.ml` packages:
+
+* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
+* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
+* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`)
+* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
+* `defaultStategy` in `mllib.tree.configuration.Strategy`
+* `build` in `mllib.tree.Node`
+* libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils`
+
+A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
+
+### Deprecations and changes of behavior
+
+**Deprecations**
+
+Deprecations in the `spark.mllib` and `spark.ml` packages include:
+
+* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
+ In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
+* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
+ In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
+ the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
+* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
+ In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
+ We move all functionality in overridden methods to the corresponding `transformSchema`.
+* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
+ In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
+ We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
+* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
+ In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
+* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
+ In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`.
+* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`.
+
+**Changes of behavior**
+
+Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
+
+* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
+ `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
+ This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
+    * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
+    * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
+* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
+ In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
+ the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
+* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
+ Fix a bug of `PowerIterationClustering` which will likely change its result.
+* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
+ `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
+* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
+ `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
+* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
+ `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
+* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
+ The `expectedType` argument for PySpark `Param` was removed.
+* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
+ Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed.
+* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
+ `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
+ The output buckets will differ for same input data and params.
+
+## Previous Spark versions
+
+Earlier migration guides are archived [on this page](ml-migration-guides.html).
 
-</div>
+---

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/ml-linear-methods.md b/docs/ml-linear-methods.md
index a875483..eb39173 100644
--- a/docs/ml-linear-methods.md
+++ b/docs/ml-linear-methods.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Linear methods - spark.ml
-displayTitle: Linear methods - spark.ml
+title: Linear methods
+displayTitle: Linear methods
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/ml-migration-guides.md b/docs/ml-migration-guides.md
new file mode 100644
index 0000000..82bf9d7
--- /dev/null
+++ b/docs/ml-migration-guides.md
@@ -0,0 +1,159 @@
+---
+layout: global
+title: Old Migration Guides - MLlib
+displayTitle: Old Migration Guides - MLlib
+description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
+---
+
+The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
+
+## From 1.5 to 1.6
+
+There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
+deprecations and changes of behavior.
+
+Deprecations:
+
+* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
+ In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
+* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
+ In `spark.ml.classification.LogisticRegressionModel` and
+ `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
+ the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given to
+ algorithms.
+
+Changes of behavior:
+
+* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
+ `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
+ Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
+ `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
+ previous error); for small errors (`< 0.01`), it uses absolute error.
+* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
+ `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
+ tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
+ behavior of the simpler `Tokenizer` transformer.
+
+## From 1.4 to 1.5
+
+In the `spark.mllib` package, there are no breaking API changes but several behavior changes:
+
+* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
+  `RegressionMetrics.explainedVariance` returns the average regression sum of squares.
+* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
+  sorted.
+* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
+  convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.
+
+In the `spark.ml` package, there exists one breaking API change and one behavior change:
+
+* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
+  from `Params.setDefault` due to a
+  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
+* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
+  added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
+
+## From 1.3 to 1.4
+
+In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
+
+* Gradient-Boosted Trees
+    * *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed.  This is only an issues for users who wrote their own losses for GBTs.
+    * *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields.  This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
+* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed.  It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`.  The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
+
+In the `spark.ml` package, several major API changes occurred, including:
+
+* `Param` and other APIs for specifying parameters
+* `uid` unique IDs for Pipeline components
+* Reorganization of certain classes
+
+Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes here.
+However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
+changes for future releases.
+
+## From 1.2 to 1.3
+
+In the `spark.mllib` package, there were several breaking changes.  The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
+
+* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed.  The `DeveloperApi` method `analyzeBlocks` was also removed.
+* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method.  To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
+* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component.  In it, there were two changes:
+    * The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
+    * Variable `model` is no longer public.
+* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component.  In it and its associated classes, there were several changes:
+    * In `DecisionTree`, the deprecated class method `train` has been removed.  (The object/static `train` methods remain.)
+    * In `Strategy`, the `checkpointDir` parameter has been removed.  Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
+* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`.  This was never meant for external use.
+* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
+  So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.
+
+In the `spark.ml` package, the main API changes are from Spark SQL.  We list the most important changes here:
+
+* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API.  All algorithms in `spark.ml` which used to use SchemaRDD now use DataFrame.
+* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`.  These implicits have been moved, so we now call `import sqlContext.implicits._`.
+* Java APIs for SQL have also changed accordingly.  Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
+
+Other changes were in `LogisticRegression`:
+
+* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability").  The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
+* In Spark 1.2, `LogisticRegressionModel` did not include an intercept.  In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS).  The option to use an intercept will be added in the future.
+
+## From 1.1 to 1.2
+
+The only API changes in MLlib v1.2 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.2:
+
+1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
+of classes.  In MLlib v1.1, this argument was called `numClasses` in Python and
+`numClassesForClassification` in Scala.  In MLlib v1.2, the names are both set to `numClasses`.
+This `numClasses` parameter is specified either via
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Breaking change)* The API for
+[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed.
+This should generally not affect user code, unless the user manually constructs decision trees
+(instead of using the `trainClassifier` or `trainRegressor` methods).
+The tree `Node` now includes more information, including the probability of the predicted label
+(for classification).
+
+3. Printing methods' output has changed.  The `toString` (Scala/Java) and `__repr__` (Python) methods used to print the full model; they now print a summary.  For the full model, use `toDebugString`.
+
+Examples in the Spark distribution and examples in the
+[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly.
+
+## From 1.0 to 1.1
+
+The only API changes in MLlib v1.1 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.1:
+
+1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
+the implementations of trees in
+[scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
+and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
+In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes.
+In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes.
+This depth is specified by the `maxDepth` parameter in
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
+methods to build a [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+rather than using the old parameter class `Strategy`.  These new training methods explicitly
+separate classification and regression, and they replace specialized parameter types with
+simple `String` types.
+
+Examples of the new, recommended `trainClassifier` and `trainRegressor` are given in the
+[Decision Trees Guide](mllib-decision-tree.html#examples).
+
+## From 0.9 to 1.0
+
+In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
+breaking changes.  If your data is sparse, please store it in a sparse format instead of dense to
+take advantage of sparsity in both storage and computation. Details are described below.
+

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-pipeline.md
----------------------------------------------------------------------
diff --git a/docs/ml-pipeline.md b/docs/ml-pipeline.md
new file mode 100644
index 0000000..adb057b
--- /dev/null
+++ b/docs/ml-pipeline.md
@@ -0,0 +1,245 @@
+---
+layout: global
+title: ML Pipelines
+displayTitle: ML Pipelines
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+In this section, we introduce the concept of ***ML Pipelines***.
+ML Pipelines provide a uniform set of high-level APIs built on top of
+[DataFrames](sql-programming-guide.html) that help users create and tune practical
+machine learning pipelines.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Main concepts in Pipelines
+
+MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple
+algorithms into a single pipeline, or workflow.
+This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
+mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
+
+* **[`DataFrame`](ml-guide.html#dataframe)**: This ML API uses `DataFrame` from Spark SQL as an ML
+  dataset, which can hold a variety of data types.
+  E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
+
+* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
+E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
+
+* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
+E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
+
+* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
+
+* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
+
+## DataFrame
+
+Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
+This API adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
+
+`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) for a list of supported types.
+In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
+
+A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`.  See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
+
+Columns in a `DataFrame` are named.  The code examples below use names such as "text," "features," and "label."
+
+## Pipeline components
+
+### Transformers
+
+A `Transformer` is an abstraction that includes feature transformers and learned models.
+Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
+another, generally by appending one or more columns.
+For example:
+
+* A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
+  column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
+* A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
+  label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
+  column.
+
+### Estimators
+
+An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
+data.
+Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
+`Model`, which is a `Transformer`.
+For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
+`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.
+
+### Properties of pipeline components
+
+`Transformer.transform()`s and `Estimator.fit()`s are both stateless.  In the future, stateful algorithms may be supported via alternative concepts.
+
+Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).
+
+## Pipeline
+
+In machine learning, it is common to run a sequence of algorithms to process and learn from data.
+E.g., a simple text document processing workflow might include several stages:
+
+* Split each document's text into words.
+* Convert each document's words into a numerical feature vector.
+* Learn a prediction model using the feature vectors and labels.
+
+MLlib represents such a workflow as a `Pipeline`, which consists of a sequence of
+`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
+We will use this simple workflow as a running example in this section.
+
+### How it works
+
+A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
+These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
+For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
+For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.
+
+We illustrate this for the simple text document workflow.  The figure below is for the *training time* usage of a `Pipeline`.
+
+<p style="text-align: center;">
+  <img
+    src="img/ml-Pipeline.png"
+    title="ML Pipeline Example"
+    alt="ML Pipeline Example"
+    width="80%"
+  />
+</p>
+
+Above, the top row represents a `Pipeline` with three stages.
+The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
+The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
+The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
+The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
+The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
+Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
+If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()`
+method on the `DataFrame` before passing the `DataFrame` to the next stage.
+
+A `Pipeline` is an `Estimator`.
+Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
+`Transformer`.
+This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
+
+<p style="text-align: center;">
+  <img
+    src="img/ml-PipelineModel.png"
+    title="ML PipelineModel Example"
+    alt="ML PipelineModel Example"
+    width="80%"
+  />
+</p>
+
+In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
+When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
+through the fitted pipeline in order.
+Each stage's `transform()` method updates the dataset and passes it to the next stage.
+
+`Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
+
+### Details
+
+*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array.  The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage.  It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG).  This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters).  If the `Pipeline` forms a DAG, then the stages must be specified in topological order.
+
+*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
+compile-time type checking.
+`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
+This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
+
+*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances.  E.g., the same instance
+`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
+unique IDs.  However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
+can be put into the same `Pipeline` since different instances will be created with different IDs.
+
+## Parameters
+
+MLlib `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
+
+A `Param` is a named parameter with self-contained documentation.
+A `ParamMap` is a set of (parameter, value) pairs.
+
+There are two main ways to pass parameters to an algorithm:
+
+1. Set parameters for an instance.  E.g., if `lr` is an instance of `LogisticRegression`, one could
+   call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
+   This API resembles the API used in `spark.mllib` package.
+2. Pass a `ParamMap` to `fit()` or `transform()`.  Any parameters in the `ParamMap` will override parameters previously specified via setter methods.
+
+Parameters belong to specific instances of `Estimator`s and `Transformer`s.
+For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
+This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
+
+## Saving and Loading Pipelines
+
+Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm's API documentation to see if saving and loading is supported.
+
+# Code examples
+
+This section gives code examples illustrating the functionality discussed above.
+For more info, please refer to the API documentation
+([Scala](api/scala/index.html#org.apache.spark.ml.package),
+[Java](api/java/org/apache/spark/ml/package-summary.html),
+and [Python](api/python/pyspark.ml.html)).
+
+## Example: Estimator, Transformer, and Param
+
+This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/estimator_transformer_param_example.py %}
+</div>
+
+</div>
+
+## Example: Pipeline
+
+This example follows the simple text document `Pipeline` illustrated in the figures above.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/pipeline_example.py %}
+</div>
+
+</div>
+
+## Model selection (hyperparameter tuning)
+
+A big benefit of using ML Pipelines is hyperparameter optimization.  See the [ML Tuning Guide](ml-tuning.html) for more information on automatic model selection.

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-survival-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-survival-regression.md b/docs/ml-survival-regression.md
index 856ceb2..efa3c21 100644
--- a/docs/ml-survival-regression.md
+++ b/docs/ml-survival-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Survival Regression - spark.ml
-displayTitle: Survival Regression - spark.ml
+title: Survival Regression
+displayTitle: Survival Regression
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/ml-tuning.md
----------------------------------------------------------------------
diff --git a/docs/ml-tuning.md b/docs/ml-tuning.md
new file mode 100644
index 0000000..2ca90c7
--- /dev/null
+++ b/docs/ml-tuning.md
@@ -0,0 +1,121 @@
+---
+layout: global
+title: "ML Tuning"
+displayTitle: "ML Tuning: model selection and hyperparameter tuning"
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+This section describes how to use MLlib's tooling for tuning ML algorithms and Pipelines.
+Built-in Cross-Validation and other tooling allow users to optimize hyperparameters in algorithms and Pipelines.
+
+**Table of contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Model selection (a.k.a. hyperparameter tuning)
+
+An important task in ML is *model selection*, or using data to find the best model or parameters for a given task.  This is also called *tuning*.
+Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps.  Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
+
+MLlib supports model selection using tools such as [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) and [`TrainValidationSplit`](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit).
+These tools require the following items:
+
+* [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator): algorithm or `Pipeline` to tune
+* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over
+* [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator): metric to measure how well a fitted `Model` does on held-out test data
+
+At a high level, these model selection tools work as follows:
+
+* They split the input data into separate training and test datasets.
+* For each (training, test) pair, they iterate through the set of `ParamMap`s:
+  * For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`.
+* They select the `Model` produced by the best-performing set of parameters.
+
+The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
+for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
+for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
+for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
+method in each of these evaluators.
+
+To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
+
+# Cross-Validation
+
+`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.  To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
+
+After identifying the best `ParamMap`, `CrossValidator` finally re-fits the `Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via cross-validation
+
+The following example demonstrates using `CrossValidator` to select from a grid of parameters.
+
+Note that cross-validation over a grid of parameters is expensive.
+E.g., in the example below, the parameter grid has 3 values for `hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` uses 2 folds.  This multiplies out to `$(3 \times 2) \times 2 = 12$` different models being trained.
+In realistic settings, it can be common to try many more parameters and use more folds (`$k=3$` and `$k=10$` are common).
+In other words, using `CrossValidator` can be very expensive.
+However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java %}
+</div>
+
+<div data-lang="python">
+
+{% include_example python/ml/cross_validator.py %}
+</div>
+
+</div>
+
+# Train-Validation Split
+
+In addition to  `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
+`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
+ the case of `CrossValidator`. It is therefore less expensive,
+ but will not produce as reliable results when the training dataset is not sufficiently large.
+
+Unlike `CrossValidator`, `TrainValidationSplit` creates a single (training, test) dataset pair.
+It splits the dataset into these two parts using the `trainRatio` parameter. For example with `$trainRatio=0.75$`,
+`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
+
+Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via train validation split
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/train_validation_split.py %}
+</div>
+
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
index aaf8bd4..a7b90de 100644
--- a/docs/mllib-classification-regression.md
+++ b/docs/mllib-classification-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Classification and Regression - spark.mllib
-displayTitle: Classification and Regression - spark.mllib
+title: Classification and Regression - RDD-based API
+displayTitle: Classification and Regression - RDD-based API
 ---
 
 The `spark.mllib` package supports various methods for 

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-clustering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 073927c..d5f6ae3 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Clustering - spark.mllib
-displayTitle: Clustering - spark.mllib
+title: Clustering - RDD-based API
+displayTitle: Clustering - RDD-based API
 ---
 
 [Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is an unsupervised learning problem whereby we aim to group subsets

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index 5c33292..0f891a0 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Collaborative Filtering - spark.mllib
-displayTitle: Collaborative Filtering - spark.mllib
+title: Collaborative Filtering - RDD-based API
+displayTitle: Collaborative Filtering - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-data-types.md
----------------------------------------------------------------------
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index ef56aeb..7dd3c97 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Data Types - MLlib
-displayTitle: Data Types - MLlib
+title: Data Types - RDD-based API
+displayTitle: Data Types - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-decision-tree.md
----------------------------------------------------------------------
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 11f5de1..0e753b8 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Decision Trees - spark.mllib
-displayTitle: Decision Trees - spark.mllib
+title: Decision Trees - RDD-based API
+displayTitle: Decision Trees - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-dimensionality-reduction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
index cceddce..539cbc1 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Dimensionality Reduction - spark.mllib
-displayTitle: Dimensionality Reduction - spark.mllib
+title: Dimensionality Reduction - RDD-based API
+displayTitle: Dimensionality Reduction - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-ensembles.md
----------------------------------------------------------------------
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 5543262..e1984b6 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Ensembles - spark.mllib
-displayTitle: Ensembles - spark.mllib
+title: Ensembles - RDD-based API
+displayTitle: Ensembles - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-evaluation-metrics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index c49bc4f..ac82f43 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Evaluation Metrics - spark.mllib
-displayTitle: Evaluation Metrics - spark.mllib
+title: Evaluation Metrics - RDD-based API
+displayTitle: Evaluation Metrics - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 67c033e..867be7f 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Feature Extraction and Transformation - spark.mllib
-displayTitle: Feature Extraction and Transformation - spark.mllib
+title: Feature Extraction and Transformation - RDD-based API
+displayTitle: Feature Extraction and Transformation - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/90686abb/docs/mllib-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md
index a7b55dc..93e3f0b 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Frequent Pattern Mining - spark.mllib
-displayTitle: Frequent Pattern Mining - spark.mllib
+title: Frequent Pattern Mining - RDD-based API
+displayTitle: Frequent Pattern Mining - RDD-based API
 ---
 
 Mining frequent items, itemsets, subsequences, or other substructures is usually among the


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message