http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/distance_metrics.md

diff git a/docs/apis/batch/libs/ml/distance_metrics.md b/docs/apis/batch/libs/ml/distance_metrics.md
deleted file mode 100644
index 303de4a..0000000
 a/docs/apis/batch/libs/ml/distance_metrics.md
+++ /dev/null
@@ 1,111 +0,0 @@

mathjax: include
title: Distance Metrics

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Distance Metrics

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

Different metrics of distance are convenient for different types of analysis. Flink ML provides
builtin implementations for many standard distance metrics. You can create custom
distance metrics by implementing the `DistanceMetric` trait.

## Builtin Implementations

Currently, FlinkML supports the following metrics:

<table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Metric</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>Euclidean Distance</strong></td>
 <td>
 $$d(\x, \y) = \sqrt{\sum_{i=1}^n \left(x_i  y_i \right)^2}$$
 </td>
 </tr>
 <tr>
 <td><strong>Squared Euclidean Distance</strong></td>
 <td>
 $$d(\x, \y) = \sum_{i=1}^n \left(x_i  y_i \right)^2$$
 </td>
 </tr>
 <tr>
 <td><strong>Cosine Similarity</strong></td>
 <td>
 $$d(\x, \y) = 1  \frac{\x^T \y}{\Vert \x \Vert \Vert \y \Vert}$$
 </td>
 </tr>
 <tr>
 <td><strong>Chebyshev Distance</strong></td>
 <td>
 $$d(\x, \y) = \max_{i}\left(\left \vert x_i  y_i \right\vert \right)$$
 </td>
 </tr>
 <tr>
 <td><strong>Manhattan Distance</strong></td>
 <td>
 $$d(\x, \y) = \sum_{i=1}^n \left\vert x_i  y_i \right\vert$$
 </td>
 </tr>
 <tr>
 <td><strong>Minkowski Distance</strong></td>
 <td>
 $$d(\x, \y) = \left( \sum_{i=1}^{n} \left( x_i  y_i \right)^p \right)^{\rfrac{1}{p}}$$
 </td>
 </tr>
 <tr>
 <td><strong>Tanimoto Distance</strong></td>
 <td>
 $$d(\x, \y) = 1  \frac{\x^T\y}{\Vert \x \Vert^2 + \Vert \y \Vert^2  \x^T\y}$$
 with $\x$ and $\y$ being bitvectors
 </td>
 </tr>
 </tbody>
 </table>

## Custom Implementation

You can create your own distance metric by implementing the `DistanceMetric` trait.

{% highlight scala %}
class MyDistance extends DistanceMetric {
 override def distance(a: Vector, b: Vector) = ... // your implementation for distance metric
}

object MyDistance {
 def apply() = new MyDistance()
}

val myMetric = MyDistance()
{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/index.md

diff git a/docs/apis/batch/libs/ml/index.md b/docs/apis/batch/libs/ml/index.md
deleted file mode 100644
index 39b3a02..0000000
 a/docs/apis/batch/libs/ml/index.md
+++ /dev/null
@@ 1,151 +0,0 @@

title: "FlinkML  Machine Learning for Flink"
# Top navigation
topnavgroup: libs
topnavpos: 2
topnavtitle: Machine Learning
# Sub navigation
subnavgroup: batch
subnavid: flinkml
subnavpos: 2
subnavparent: libs
subnavtitle: Machine Learning

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
with a growing list of algorithms and contributors. With FlinkML we aim to provide
scalable ML algorithms, an intuitive API, and tools that help minimize glue code in endtoend ML
systems. You can see more details about our goals and where the library is headed in our [vision
and roadmap here](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap).

* This will be replaced by the TOC
{:toc}

## Supported Algorithms

FlinkML currently supports the following algorithms:

### Supervised Learning

* [SVM using Communication efficient distributed dual coordinate ascent (CoCoA)](svm.html)
* [Multiple linear regression](multiple_linear_regression.html)
* [Optimization Framework](optimization.html)

### Unsupervised Learning

* [kNearest neighbors join](knn.html)

### Data Preprocessing

* [Polynomial Features](polynomial_features.html)
* [Standard Scaler](standard_scaler.html)
* [MinMax Scaler](min_max_scaler.html)

### Recommendation

* [Alternating Least Squares (ALS)](als.html)

### Utilities

* [Distance Metrics](distance_metrics.html)
* [Cross Validation](cross_validation.html)

## Getting Started

You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started
example.

If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/apis/batch/index.html#linkingwithflink).
Next, you have to add the FlinkML dependency to the `pom.xml` of your project.

{% highlight xml %}
<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flinkml{{ site.scala_version_suffix }}</artifactId>
 <version>{{site.version }}</version>
</dependency>
{% endhighlight %}

Note that FlinkML is currently not part of the binary distribution.
See linking with it for cluster execution [here]({{site.baseurl}}/apis/cluster_execution.html#linkingwithmodulesnotcontainedinthebinarydistribution).

Now you can start solving your analysis task.
The following code snippet shows how easy it is to train a multiple linear regression model.

{% highlight scala %}


// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...

// Alternatively, a Splitter is used to break up a DataSet into training and testing data.
val dataSet: DataSet[LabeledVector] = ...
val trainTestData: DataSet[TrainTestDataSet] = Splitter.trainTestSplit(dataSet)
val trainingData: DataSet[LabeledVector] = trainTestData.training
val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)

val mlr = MultipleLinearRegression()
 .setStepsize(1.0)
 .setIterations(100)
 .setConvergenceThreshold(0.001)

mlr.fit(trainingData)

// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
{% endhighlight %}

## Pipelines

A key concept of FlinkML is its [scikitlearn](http://scikitlearn.org) inspired pipelining mechanism.
It allows you to quickly build complex data analysis pipelines how they appear in every data scientist's daily work.
An indepth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html).

The following example code shows how easy it is to set up an analysis pipeline with FlinkML.

{% highlight scala %}
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...

val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()

// Construct pipeline of standard scaler, polynomial features and multiple linear regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

// Train pipeline
pipeline.fit(trainingData)

// Calculate predictions
val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
{% endhighlight %}

One can chain a `Transformer` to another `Transformer` or a set of chained `Transformers` by calling the method `chainTransformer`.
If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Transformers`, one has to call the method `chainPredictor`.


## How to contribute

The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries.
In order to get quickly started with contributing to FlinkML, please read our official
[contribution guide]({{site.baseurl}}/libs/ml/contribution_guide.html).
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/knn.md

diff git a/docs/apis/batch/libs/ml/knn.md b/docs/apis/batch/libs/ml/knn.md
deleted file mode 100644
index 294d333..0000000
 a/docs/apis/batch/libs/ml/knn.md
+++ /dev/null
@@ 1,149 +0,0 @@

mathjax: include
htmlTitle: FlinkML  kNearest neighbors join
title: <a href="../ml">FlinkML</a>  kNearest neighbors join

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: kNearest neighbors join

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description
Implements an exact knearest neighbors join algorithm. Given a training set $A$ and a testing set $B$, the algorithm returns

$$
KNNJ(A, B, k) = \{ \left( b, KNN(b, A, k) \right) \text{ where } b \in B \text{ and } KNN(b, A, k) \text{ are the knearest points to }b\text{ in }A \}
$$

The bruteforce approach is to compute the distance between every training and testing point. To ease the bruteforce computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.

## Operations

`KNN` is a `Predictor`.
As such, it supports the `fit` and `predict` operation.

### Fit

KNN is trained by a given set of `Vector`:

* `fit[T <: Vector]: DataSet[T] => Unit`

### Predict

KNN predicts for all subtypes of FlinkML's `Vector` the corresponding class label:

* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where the `(T, Array[Vector])` tuple
 corresponds to (test point, Knearest training points)

## Parameters

The KNN implementation can be controlled by the following parameters:

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameters</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>K</strong></td>
 <td>
 <p>
 Defines the number of nearestneighbors to search for. That is, for each test point, the algorithm finds the Knearest neighbors in the training set
 (Default value: <strong>5</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>DistanceMetric</strong></td>
 <td>
 <p>
 Sets the distance metric we use to calculate the distance between two points. If no metric is specified, then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
 (Default value: <strong>EuclideanDistanceMetric</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Blocks</strong></td>
 <td>
 <p>
 Sets the number of blocks into which the input data will be split. This number should be set
 at least to the degree of parallelism. If no value is specified, then the parallelism of the
 input [[DataSet]] is used as the number of blocks.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>UseQuadTree</strong></td>
 <td>
 <p>
 A boolean variable that whether or not to use a quadtree to partition the training set to potentially simplify the KNN search. If no value is specified, the code will automatically decide whether or not to use a quadtree. Use of a quadtree scales well with the number of training and testing points, though poorly with the dimension.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>SizeHint</strong></td>
 <td>
 <p>Specifies whether the training set or test set is small to optimize the cross product operation needed for the KNN search. If the training set is small this should be `CrossHint.FIRST_IS_SMALL` and set to `CrossHint.SECOND_IS_SMALL` if the test set is small.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 </tbody>
 </table>

## Examples

{% highlight scala %}
import org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
import org.apache.flink.api.scala._
import org.apache.flink.ml.nn.KNN
import org.apache.flink.ml.math.Vector
import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric

val env = ExecutionEnvironment.getExecutionEnvironment

// prepare data
val trainingSet: DataSet[Vector] = ...
val testingSet: DataSet[Vector] = ...

val knn = KNN()
 .setK(3)
 .setBlocks(10)
 .setDistanceMetric(SquaredEuclideanDistanceMetric())
 .setUseQuadTree(false)
 .setSizeHint(CrossHint.SECOND_IS_SMALL)

// run knn join
knn.fit(trainingSet)
val result = knn.predict(testingSet).collect()
{% endhighlight %}

For more details on the computing KNN with and without and quadtree, here is a presentation: [http://danielblazevski.github.io/](http://danielblazevski.github.io/)
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/min_max_scaler.md

diff git a/docs/apis/batch/libs/ml/min_max_scaler.md b/docs/apis/batch/libs/ml/min_max_scaler.md
deleted file mode 100644
index 2948a96..0000000
 a/docs/apis/batch/libs/ml/min_max_scaler.md
+++ /dev/null
@@ 1,116 +0,0 @@

mathjax: include
title: MinMax Scaler

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: MinMax Scaler

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

 The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
 In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
 Given a set of input data $x_1, x_2,... x_n$, with minimum value:

 $$x_{min} = min({x_1, x_2,..., x_n})$$

 and maximum value:

 $$x_{max} = max({x_1, x_2,..., x_n})$$

The scaled data set $z_1, z_2,...,z_n$ will be:

 $$z_{i}= \frac{x_{i}  x_{min}}{x_{max}  x_{min}} \left ( max  min \right ) + min$$

where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.

## Operations

`MinMaxScaler` is a `Transformer`.
As such, it supports the `fit` and `transform` operation.

### Fit

MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:

* `fit[T <: Vector]: DataSet[T] => Unit`
* `fit: DataSet[LabeledVector] => Unit`

### Transform

MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:

* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`

## Parameters

The MinMax scaler implementation can be controlled by the following two parameters:

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameters</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>Min</strong></td>
 <td>
 <p>
 The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Max</strong></td>
 <td>
 <p>
 The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
 </p>
 </td>
 </tr>
 </tbody>
</table>

## Examples

{% highlight scala %}
// Create MinMax scaler transformer
val minMaxscaler = MinMaxScaler()
 .setMin(1.0)

// Obtain data set to be scaled
val dataSet: DataSet[Vector] = ...

// Learn the minimum and maximum values of the training data
minMaxscaler.fit(dataSet)

// Scale the provided data set to have min=1.0 and max=1.0
val scaledDS = minMaxscaler.transform(dataSet)
{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/multiple_linear_regression.md

diff git a/docs/apis/batch/libs/ml/multiple_linear_regression.md b/docs/apis/batch/libs/ml/multiple_linear_regression.md
deleted file mode 100644
index b427eac..0000000
 a/docs/apis/batch/libs/ml/multiple_linear_regression.md
+++ /dev/null
@@ 1,164 +0,0 @@

mathjax: include
title: Multiple linear regression

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Multiple Linear Regression

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

 Multiple linear regression tries to find a linear function which best fits the provided input data.
 Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds
 a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized:

 $$ S(\mathbf{w}) = \sum_{i=1} \left(y  \mathbf{w}^T\mathbf{x_i} \right)^2$$

 Written in matrix notation, we obtain the following formulation:

 $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y}  X\mathbf{w})^2$$

 This problem has a closed form solution which is given by:

 $$\mathbf{w}^* = \left(X^TX\right)^{1}X^T\mathbf{y}$$

 However, in cases where the input data set is so huge that a complete parse over the whole data
 set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution.
 SGD first calculates for a random subset of the input data set the gradients. The gradient
 for a given point $\mathbf{x}_i$ is given by:

 $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} 
 y\right)\mathbf{x_i}$$

 The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$
 with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the
 current weight vector giving the new weight vector for the next iteration:

 $$\mathbf{w}_{t+1} = \mathbf{w}_t  \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$

 The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion.
 The convergence criterion is the relative change in the sum of squared residuals:

 $$\frac{S_{k1}  S_k}{S_{k1}} < \rho$$

## Operations

`MultipleLinearRegression` is a `Predictor`.
As such, it supports the `fit` and `predict` operation.

### Fit

MultipleLinearRegression is trained on a set of `LabeledVector`:

* `fit: DataSet[LabeledVector] => Unit`

### Predict

MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:

* `predict[T <: Vector]: DataSet[T] => DataSet[LabeledVector]`

If we call predict with a `DataSet[LabeledVector]`, we make a prediction on the regression value
for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element
is the true value, as was provided from the input `DataSet[LabeledVector]` and the second element
is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate
the algorithm's performance.

* `predict: DataSet[LabeledVector] => DataSet[(Double, Double)]`

## Parameters

 The multiple linear regression implementation can be controlled by the following parameters:

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameters</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>Iterations</strong></td>
 <td>
 <p>
 The maximum number of iterations. (Default value: <strong>10</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Stepsize</strong></td>
 <td>
 <p>
 Initial step size for the gradient descent method.
 This value controls how far the gradient descent method moves in the opposite direction of the gradient.
 Tuning this parameter might be crucial to make it stable and to obtain a better performance.
 (Default value: <strong>0.1</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>ConvergenceThreshold</strong></td>
 <td>
 <p>
 Threshold for relative change of the sum of squared residuals until the iteration is stopped.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>LearningRateMethod</strong></td>
 <td>
 <p>
 Learning rate method used to calculate the effective learning rate for each iteration.
 See the list of supported <a href="optimization.html">learning rate methods</a>.
 (Default value: <strong>LearningRateMethod.Default</strong>)
 </p>
 </td>
 </tr>
 </tbody>
 </table>

## Examples

{% highlight scala %}
// Create multiple linear regression learner
val mlr = MultipleLinearRegression()
.setIterations(10)
.setStepsize(0.5)
.setConvergenceThreshold(0.001)

// Obtain training and testing data set
val trainingDS: DataSet[LabeledVector] = ...
val testingDS: DataSet[Vector] = ...

// Fit the linear model to the provided data
mlr.fit(trainingDS)

// Calculate the predictions for the test data
val predictions = mlr.predict(testingDS)
{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/optimization.md

diff git a/docs/apis/batch/libs/ml/optimization.md b/docs/apis/batch/libs/ml/optimization.md
deleted file mode 100644
index ccb7e45..0000000
 a/docs/apis/batch/libs/ml/optimization.md
+++ /dev/null
@@ 1,385 +0,0 @@

mathjax: include
title: Optimization
# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Optimization

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* Table of contents
{:toc}

## Mathematical Formulation

The optimization framework in FlinkML is a developeroriented package that can be used to solve
[optimization](https://en.wikipedia.org/wiki/Mathematical_optimization)
problems common in Machine Learning (ML) tasks. In the supervised learning context, this usually
involves finding a model, as defined by a set of parameters $w$, that minimize a function $f(\wv)$
given a set of $(\x, y)$ examples,
where $\x$ is a feature vector and $y$ is a real number, which can represent either a real value in
the regression case, or a class label in the classification case. In supervised learning, the
function to be minimized is usually of the form:


\begin{equation} \label{eq:objectiveFunc}
 f(\wv) :=
 \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
 \lambda\, R(\wv)
 \ .
\end{equation}


where $L$ is the loss function and $R(\wv)$ the regularization penalty. We use $L$ to measure how
well the model fits the observed data, and we use $R$ in order to impose a complexity cost to the
model, with $\lambda > 0$ being the regularization parameter.

### Loss Functions

In supervised learning, we use loss functions in order to measure the model fit, by
penalizing errors in the predictions $p$ made by the model compared to the true $y$ for each
example. Different loss functions can be used for regression (e.g. Squared Loss) and classification
(e.g. Hinge Loss) tasks.

Some common loss functions are:

* Squared Loss: $ \frac{1}{2} \left(\wv^T \cdot \x  y\right)^2, \quad y \in \R $
* Hinge Loss: $ \max \left(0, 1  y ~ \wv^T \cdot \x\right), \quad y \in \{1, +1\} $
* Logistic Loss: $ \log\left(1+\exp\left( y ~ \wv^T \cdot \x\right)\right), \quad y \in \{1, +1\}$

### Regularization Types

[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) in machine learning
imposes penalties to the estimated models, in order to reduce overfitting. The most common penalties
are the $L_1$ and $L_2$ penalties, defined as:

* $L_1$: $R(\wv) = \norm{\wv}_1$
* $L_2$: $R(\wv) = \frac{1}{2}\norm{\wv}_2^2$

The $L_2$ penalty penalizes large weights, favoring solutions with more small weights rather than
few large ones.
The $L_1$ penalty can be used to drive a number of the solution coefficients to 0, thereby
producing sparse solutions.
The regularization constant $\lambda$ in $\eqref{eq:objectiveFunc}$ determines the amount of regularization applied to the model,
and is usually determined through model crossvalidation.
A good comparison of regularization types can be found in [this](http://www.robotics.stanford.edu/~ang/papers/icml04l1l2.pdf) paper by Andrew Ng.
Which regularization type is supported depends on the actually used optimization algorithm.

## Stochastic Gradient Descent

In order to find a (local) minimum of a function, Gradient Descent methods take steps in the
direction opposite to the gradient of the function $\eqref{eq:objectiveFunc}$ taken with
respect to the current parameters (weights).
In order to compute the exact gradient we need to perform one pass through all the points in
a dataset, making the process computationally expensive.
An alternative is Stochastic Gradient Descent (SGD) where at each iteration we sample one point
from the complete dataset and update the parameters for each point, in an online manner.

In minibatch SGD we instead sample random subsets of the dataset, and compute the gradient
over each batch. At each iteration of the algorithm we update the weights once, based on
the average of the gradients computed from each minibatch.

An important parameter is the learning rate $\eta$, or step size, which can be determined by one of five methods, listed below. The setting of the initial step size can significantly affect the performance of the
algorithm. For some practical tips on tuning SGD see Leon Botou's
"[Stochastic Gradient Descent Tricks](http://research.microsoft.com/pubs/192769/tricks2012.pdf)".

The current implementation of SGD uses the whole partition, making it
effectively a batch gradient descent. Once a sampling operator has been introduced in Flink, true
minibatch SGD will be performed.

### Regularization

FlinkML supports Stochastic Gradient Descent with L1, L2 and no regularization.
The following list contains a mapping between the implementing classes and the regularization function.

<table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Class Name</th>
 <th class="textcenter">Regularization function $R(\wv)$</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>SimpleGradient</code></td>
 <td>$R(\wv) = 0$</td>
 </tr>
 <tr>
 <td><code>GradientDescentL1</code></td>
 <td>$R(\wv) = \norm{\wv}_1$</td>
 </tr>
 <tr>
 <td><code>GradientDescentL2</code></td>
 <td>$R(\wv) = \frac{1}{2}\norm{\wv}_2^2$</td>
 </tr>
 </tbody>
</table>

### Parameters

 The stochastic gradient descent implementation can be controlled by the following parameters:

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameter</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><strong>LossFunction</strong></td>
 <td>
 <p>
 The loss function to be optimized. (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>RegularizationConstant</strong></td>
 <td>
 <p>
 The amount of regularization to apply. (Default value: <strong>0.1</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Iterations</strong></td>
 <td>
 <p>
 The maximum number of iterations. (Default value: <strong>10</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>LearningRate</strong></td>
 <td>
 <p>
 Initial learning rate for the gradient descent method.
 This value controls how far the gradient descent method moves in the opposite direction
 of the gradient.
 (Default value: <strong>0.1</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>ConvergenceThreshold</strong></td>
 <td>
 <p>
 When set, iterations stop if the relative change in the value of the objective function $\eqref{eq:objectiveFunc}$ is less than the provided threshold, $\tau$.
 The convergence criterion is defined as follows: $\left \frac{f(\wv)_{i1}  f(\wv)_i}{f(\wv)_{i1}}\right < \tau$.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>LearningRateMethod</strong></td>
 <td>
 <p>
 (Default value: <strong>LearningRateMethod.Default</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Decay</strong></td>
 <td>
 <p>
 (Default value: <strong>0.0</strong>)
 </p>
 </td>
 </tr>
 </tbody>
 </table>

### Loss Function

The loss function which is minimized has to implement the `LossFunction` interface, which defines methods to compute the loss and the gradient of it.
Either one defines ones own `LossFunction` or one uses the `GenericLossFunction` class which constructs the loss function from an outer loss function and a prediction function.
An example can be seen here

```Scala
val lossFunction = GenericLossFunction(SquaredLoss, LinearPrediction)
```

The full list of supported outer loss functions can be found [here](#partiallossfunctionvalues).
The full list of supported prediction functions can be found [here](#predictionfunctionvalues).

#### Partial Loss Function Values ##

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Function Name</th>
 <th class="textcenter">Description</th>
 <th class="textcenter">Loss</th>
 <th class="textcenter">Loss Derivative</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><strong>SquaredLoss</strong></td>
 <td>
 <p>
 Loss function most commonly used for regression tasks.
 </p>
 </td>
 <td class="textcenter">$\frac{1}{2} (\wv^T \cdot \x  y)^2$</td>
 <td class="textcenter">$\wv^T \cdot \x  y$</td>
 </tr>
 </tbody>
 </table>

#### Prediction Function Values ##

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Function Name</th>
 <th class="textcenter">Description</th>
 <th class="textcenter">Prediction</th>
 <th class="textcenter">Prediction Gradient</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><strong>LinearPrediction</strong></td>
 <td>
 <p>
 The function most commonly used for linear models, such as linear regression and
 linear classifiers.
 </p>
 </td>
 <td class="textcenter">$\x^T \cdot \wv$</td>
 <td class="textcenter">$\x$</td>
 </tr>
 </tbody>
 </table>

#### Effective Learning Rate ##

Where:

 $j$ is the iteration number

 $\eta_j$ is the step size on step $j$

 $\eta_0$ is the initial step size

 $\lambda$ is the regularization constant

 $\tau$ is the decay constant, which causes the learning rate to be a decreasing function of $j$, that is to say as iterations increase, learning rate decreases. The exact rate of decay is function specific, see **Inverse Scaling** and **Wei Xu's Method** (which is an extension of the **Inverse Scaling** method).

<table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Function Name</th>
 <th class="textcenter">Description</th>
 <th class="textcenter">Function</th>
 <th class="textcenter">Called As</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><strong>Default</strong></td>
 <td>
 <p>
 The function default method used for determining the step size. This is equivalent to the inverse scaling method for $\tau$ = 0.5. This special case is kept as the default to maintain backwards compatibility.
 </p>
 </td>
 <td class="textcenter">$\eta_j = \eta_0/\sqrt{j}$</td>
 <td class="textcenter"><code>LearningRateMethod.Default</code></td>
 </tr>
 <tr>
 <td><strong>Constant</strong></td>
 <td>
 <p>
 The step size is constant throughout the learning task.
 </p>
 </td>
 <td class="textcenter">$\eta_j = \eta_0$</td>
 <td class="textcenter"><code>LearningRateMethod.Constant</code></td>
 </tr>
 <tr>
 <td><strong>Leon Bottou's Method</strong></td>
 <td>
 <p>
 This is the <code>'optimal'</code> method of sklearn.
 The optimal initial value $t_0$ has to be provided.
 Sklearn uses the following heuristic: $t_0 = \max(1.0, L^\prime(\beta, 1.0) / (\alpha \cdot \beta)$
 with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
 </p>
 </td>
 <td class="textcenter">$\eta_j = 1 / (\lambda \cdot (t_0 + j 1)) $</td>
 <td class="textcenter"><code>LearningRateMethod.Bottou</code></td>
 </tr>
 <tr>
 <td><strong>Inverse Scaling</strong></td>
 <td>
 <p>
 A very common method for determining the step size.
 </p>
 </td>
 <td class="textcenter">$\eta_j = \eta_0 / j^{\tau}$</td>
 <td class="textcenter"><code>LearningRateMethod.InvScaling</code></td>
 </tr>
 <tr>
 <td><strong>Wei Xu's Method</strong></td>
 <td>
 <p>
 Method proposed by Wei Xu in <a href="http://arxiv.org/pdf/1107.2490.pdf">Towards Optimal One Pass Large Scale Learning with
 Averaged Stochastic Gradient Descent</a>
 </p>
 </td>
 <td class="textcenter">$\eta_j = \eta_0 \cdot (1+ \lambda \cdot \eta_0 \cdot j)^{\tau} $</td>
 <td class="textcenter"><code>LearningRateMethod.Xu</code></td>
 </tr>
 </tbody>
 </table>

### Examples

In the Flink implementation of SGD, given a set of examples in a `DataSet[LabeledVector]` and
optionally some initial weights, we can use `GradientDescentL1.optimize()` in order to optimize
the weights for the given data.

The user can provide an initial `DataSet[WeightVector]`,
which contains one `WeightVector` element, or use the default weights which are all set to 0.
A `WeightVector` is a container class for the weights, which separates the intercept from the
weight vector. This allows us to avoid applying regularization to the intercept.



{% highlight scala %}
// Create stochastic gradient descent solver
val sgd = GradientDescentL1()
 .setLossFunction(SquaredLoss())
 .setRegularizationConstant(0.2)
 .setIterations(100)
 .setLearningRate(0.01)
 .setLearningRateMethod(LearningRateMethod.Xu(0.75))


// Obtain data
val trainingDS: DataSet[LabeledVector] = ...

// Optimize the weights, according to the provided data
val weightDS = sgd.optimize(trainingDS)
{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/pipelines.md

diff git a/docs/apis/batch/libs/ml/pipelines.md b/docs/apis/batch/libs/ml/pipelines.md
deleted file mode 100644
index f86476c..0000000
 a/docs/apis/batch/libs/ml/pipelines.md
+++ /dev/null
@@ 1,445 +0,0 @@

mathjax: include
title: Looking under the hood of pipelines
# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Pipelines

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Introduction

The ability to chain together different transformers and predictors is an important feature for
any Machine Learning (ML) library. In FlinkML we wanted to provide an intuitive API,
and at the same
time utilize the capabilities of the Scala language to provide
typesafe implementations of our pipelines. What we hope to achieve then is an easy to use API,
that protects users from type errors at preflight (before the job is launched) time, thereby
eliminating cases where long
running jobs are submitted to the cluster only to see them fail due to some
error in the series of data transformations that commonly happen in an ML pipeline.

In this guide then we will describe the choices we made during the implementation of chainable
transformers and predictors in FlinkML, and provide guidelines on how developers can create their
own algorithms that make use of these capabilities.

## The what and the why

So what do we mean by "ML pipelines"? Pipelines in the ML context can be thought of as chains of
operations that have some data as input, perform a number of transformations to that data,
and
then output the transformed data, either to be used as the input (features) of a predictor
function, such as a learning model, or just output the transformed data themselves, to be used in
some other task. The end learner can of course be a part of the pipeline as well.
ML pipelines can often be complicated sets of operations ([indepth explanation](http://research.google.com/pubs/pub43146.html)) and
can become sources of errors for endtoend learning systems.

The purpose of ML pipelines is then to create a
framework that can be used to manage the complexity introduced by these chains of operations.
Pipelines should make it easy for developers to define chained transformations that can be
applied to the
training data, in order to create the end features that will be used to train a
learning model, and then perform the same set of transformations just as easily to unlabeled
(test) data. Pipelines should also simplify crossvalidation and model selection on
these chains of operations.

Finally, by ensuring that the consecutive links in the pipeline chain "fit together" we also
avoid costly type errors. Since each step in a pipeline can be a computationallyheavy operation,
we want to avoid running a pipelined job, unless we are sure that all the input/output pairs in a
pipeline "fit".

## Pipelines in FlinkML

The building blocks for pipelines in FlinkML can be found in the `ml.pipeline` package.
FlinkML follows an API inspired by [sklearn](http://scikitlearn.org) which means that we have
`Estimator`, `Transformer` and `Predictor` interfaces. For an indepth look at the design of the
sklearn API the interested reader is referred to [this](http://arxiv.org/abs/1309.0238) paper.
In short, the `Estimator` is the base class from which `Transformer` and `Predictor` inherit.
`Estimator` defines a `fit` method, and `Transformer` also defines a `transform` method and
`Predictor` defines a `predict` method.

The `fit` method of the `Estimator` performs the actual training of the model, for example
finding the correct weights in a linear regression task, or the mean and standard deviation of
the data in a feature scaler.
As evident by the naming, classes that implement
`Transformer` are transform operations like [scaling the input](standard_scaler.html) and
`Predictor` implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/libs/ml/multiple_linear_regression.html).
Pipelines can be created by chaining together a number of Transformers, and the final link in a pipeline can be a Predictor or another Transformer.
Pipelines that end with Predictor cannot be chained any further.
Below is an example of how a pipeline can be formed:

{% highlight scala %}
// Training data
val input: DataSet[LabeledVector] = ...
// Test data
val unlabeled: DataSet[Vector] = ...

val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures()
val mlr = MultipleLinearRegression()

// Construct the pipeline
val pipeline = scaler
 .chainTransformer(polyFeatures)
 .chainPredictor(mlr)

// Train the pipeline (scaler and multiple linear regression)
pipeline.fit(input)

// Calculate predictions for the testing data
val predictions: DataSet[LabeledVector] = pipeline.predict(unlabeled)

{% endhighlight %}

As we mentioned, FlinkML pipelines are typesafe.
If we tried to chain a transformer with output of type `A` to another with input of type `B` we
would get an error at preflight time if `A` != `B`. FlinkML achieves this kind of typesafety
through the use of Scala's implicits.

### Scala implicits

If you are not familiar with Scala's implicits we can recommend [this excerpt](https://www.artima.com/pins1ed/implicitconversionsandparameters.html)
from Martin Odersky's "Programming in Scala". In short, implicit conversions allow for adhoc
polymorphism in Scala by providing conversions from one type to another, and implicit values
provide the compiler with default values that can be supplied to function calls through implicit parameters.
The combination of implicit conversions and implicit parameters is what allows us to chain transform
and predict operations together in a typesafe manner.

### Operations

As we mentioned, the trait (abstract class) `Estimator` defines a `fit` method. The method has two
parameter lists
(i.e. is a [curried function](http://docs.scalalang.org/tutorials/tour/currying.html)). The
first parameter list
takes the input (training) `DataSet` and the parameters for the estimator. The second parameter
list takes one `implicit` parameter, of type `FitOperation`. `FitOperation` is a class that also
defines a `fit` method, and this is where the actual logic of training the concrete Estimators
should be implemented. The `fit` method of `Estimator` is essentially a wrapper around the fit
method of `FitOperation`. The `predict` method of `Predictor` and the `transform` method of
`Transform` are designed in a similar manner, with a respective operation class.

In these methods the operation object is provided as an implicit parameter.
Scala will [look for implicits](http://docs.scalalang.org/tutorials/FAQ/findingimplicits.html)
in the companion object of a type, so classes that implement these interfaces should provide these
objects as implicit objects inside the companion object.

As an example we can look at the `StandardScaler` class. `StandardScaler` extends `Transformer`, so it has access to its `fit` and `transform` functions.
These two functions expect objects of `FitOperation` and `TransformOperation` as implicit parameters,
for the `fit` and `transform` methods respectively, which `StandardScaler` provides in its companion
object, through `transformVectors` and `fitVectorStandardScaler`:

{% highlight scala %}
class StandardScaler extends Transformer[StandardScaler] {
 ...
}

object StandardScaler {

 ...

 implicit def fitVectorStandardScaler[T <: Vector] = new FitOperation[StandardScaler, T] {
 override def fit(instance: StandardScaler, fitParameters: ParameterMap, input: DataSet[T])
 : Unit = {
 ...
 }

 implicit def transformVectors[T <: Vector: VectorConverter: TypeInformation: ClassTag] = {
 new TransformOperation[StandardScaler, T, T] {
 override def transform(
 instance: StandardScaler,
 transformParameters: ParameterMap,
 input: DataSet[T])
 : DataSet[T] = {
 ...
 }

}

{% endhighlight %}

Note that `StandardScaler` does **not** override the `fit` method of `Estimator` or the `transform`
method of `Transformer`. Rather, its implementations of `FitOperation` and `TransformOperation`
override their respective `fit` and `transform` methods, which are then called by the `fit` and
`transform` methods of `Estimator` and `Transformer`. Similarly, a class that implements
`Predictor` should define an implicit `PredictOperation` object inside its companion object.

#### Types and type safety

Apart from the `fit` and `transform` operations that we listed above, the `StandardScaler` also
provides `fit` and `transform` operations for input of type `LabeledVector`.
This allows us to use the algorithm for input that is labeled or unlabeled, and this happens
automatically, depending on the type of the input that we give to the fit and transform
operations. The correct implicit operation is chosen by the compiler, depending on the input type.

If we try to call the `fit` or `transform` methods with types that are not supported we will get a
runtime error before the job is launched.
While it would be possible to catch these kinds of errors at compile time as well, the error
messages that we are able to provide the user would be much less informative, which is why we chose
to throw runtime exceptions instead.

### Chaining

Chaining is achieved by calling `chainTransformer` or `chainPredictor` on an object
of a class that implements `Transformer`. These methods return a `ChainedTransformer` or
`ChainedPredictor` object respectively. As we mentioned, `ChainedTransformer` objects can be
chained further, while `ChainedPredictor` objects cannot. These classes take care of applying
fit, transform, and predict operations for a pair of successive transformers or
a transformer and a predictor. They also act recursively if the length of the
chain is larger than two, since every `ChainedTransformer` defines a `transform` and `fit`
operation that can be further chained with more transformers or a predictor.

It is important to note that developers and users do not need to worry about chaining when
implementing their algorithms, all this is handled automatically by FlinkML.

### How to Implement a Pipeline Operator

In order to support FlinkML's pipelining, algorithms have to adhere to a certain design pattern, which we will describe in this section.
Let's assume that we want to implement a pipeline operator which changes the mean of your data.
Since centering data is a common preprocessing step in many analysis pipelines, we will implement it as a `Transformer`.
Therefore, we first create a `MeanTransformer` class which inherits from `Transformer`

{% highlight scala %}
class MeanTransformer extends Transformer[MeanTransformer] {}
{% endhighlight %}

Since we want to be able to configure the mean of the resulting data, we have to add a configuration parameter.

{% highlight scala %}
class MeanTransformer extends Transformer[MeanTransformer] {
 def setMean(mean: Double): this.type = {
 parameters.add(MeanTransformer.Mean, mean)
 this
 }
}

object MeanTransformer {
 case object Mean extends Parameter[Double] {
 override val defaultValue: Option[Double] = Some(0.0)
 }

 def apply(): MeanTransformer = new MeanTransformer
}
{% endhighlight %}

Parameters are defined in the companion object of the transformer class and extend the `Parameter` class.
Since the parameter instances are supposed to act as immutable keys for a parameter map, they should be implemented as `case objects`.
The default value will be used if no other value has been set by the user of this component.
If no default value has been specified, meaning that `defaultValue = None`, then the algorithm has to handle this situation accordingly.

We can now instantiate a `MeanTransformer` object and set the mean value of the transformed data.
But we still have to implement how the transformation works.
The workflow can be separated into two phases.
Within the first phase, the transformer learns the mean of the given training data.
This knowledge can then be used in the second phase to transform the provided data with respect to the configured resulting mean value.

The learning of the mean can be implemented within the `fit` operation of our `Transformer`, which it inherited from `Estimator`.
Within the `fit` operation, a pipeline component is trained with respect to the given training data.
The algorithm is, however, **not** implemented by overriding the `fit` method but by providing an implementation of a corresponding `FitOperation` for the correct type.
Taking a look at the definition of the `fit` method in `Estimator`, which is the parent class of `Transformer`, reveals what why this is the case.

{% highlight scala %}
trait Estimator[Self] extends WithParameters with Serializable {
 that: Self =>

 def fit[Training](
 training: DataSet[Training],
 fitParameters: ParameterMap = ParameterMap.Empty)
 (implicit fitOperation: FitOperation[Self, Training]): Unit = {
 FlinkMLTools.registerFlinkMLTypes(training.getExecutionEnvironment)
 fitOperation.fit(this, fitParameters, training)
 }
}
{% endhighlight %}

We see that the `fit` method is called with an input data set of type `Training`, an optional parameter list and in the second parameter list with an implicit parameter of type `FitOperation`.
Within the body of the function, first some machine learning types are registered and then the `fit` method of the `FitOperation` parameter is called.
The instance gives itself, the parameter map and the training data set as a parameters to the method.
Thus, all the program logic takes place within the `FitOperation`.

The `FitOperation` has two type parameters.
The first defines the pipeline operator type for which this `FitOperation` shall work and the second type parameter defines the type of the data set elements.
If we first wanted to implement the `MeanTransformer` to work on `DenseVector`, we would, thus, have to provide an implementation for `FitOperation[MeanTransformer, DenseVector]`.

{% highlight scala %}
val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
 override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
 import org.apache.flink.ml.math.Breeze._
 val meanTrainingData: DataSet[DenseVector] = input
 .map{ x => (x.asBreeze, 1) }
 .reduce{
 (left, right) =>
 (left._1 + right._1, left._2 + right._2)
 }
 .map{ p => (p._1/p._2).fromBreeze }
 }
}
{% endhighlight %}

A `FitOperation[T, I]` has a `fit` method which is called with an instance of type `T`, a parameter map and an input `DataSet[I]`.
In our case `T=MeanTransformer` and `I=DenseVector`.
The parameter map is necessary if our fit step depends on some parameter values which were not given directly at creation time of the `Transformer`.
The `FitOperation` of the `MeanTransformer` sums the `DenseVector` instances of the given input data set up and divides the result by the total number of vectors.
That way, we obtain a `DataSet[DenseVector]` with a single element which is the mean value.

But if we look closely at the implementation, we see that the result of the mean computation is never stored anywhere.
If we want to use this knowledge in a later step to adjust the mean of some other input, we have to keep it around.
And here is where the parameter of type `MeanTransformer` which is given to the `fit` method comes into play.
We can use this instance to store state, which is used by a subsequent `transform` operation which works on the same object.
But first we have to extend `MeanTransformer` by a member field and then adjust the `FitOperation` implementation.

{% highlight scala %}
class MeanTransformer extends Transformer[Centering] {
 var meanOption: Option[DataSet[DenseVector]] = None

 def setMean(mean: Double): Mean = {
 parameters.add(MeanTransformer.Mean, mu)
 }
}

val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
 override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
 import org.apache.flink.ml.math.Breeze._

 instance.meanOption = Some(input
 .map{ x => (x.asBreeze, 1) }
 .reduce{
 (left, right) =>
 (left._1 + right._1, left._2 + right._2)
 }
 .map{ p => (p._1/p._2).fromBreeze })
 }
}
{% endhighlight %}

If we look at the `transform` method in `Transformer`, we will see that we also need an implementation of `TransformOperation`.
A possible mean transforming implementation could look like the following.

{% highlight scala %}

val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] {
 override def transform(
 instance: MeanTransformer,
 transformParameters: ParameterMap,
 input: DataSet[DenseVector])
 : DataSet[DenseVector] = {
 val resultingParameters = parameters ++ transformParameters

 val resultingMean = resultingParameters(MeanTransformer.Mean)

 instance.meanOption match {
 case Some(trainingMean) => {
 input.map{ new MeanTransformMapper(resultingMean) }.withBroadcastSet(trainingMean, "trainingMean")
 }
 case None => throw new RuntimeException("MeanTransformer has not been fitted to data.")
 }
 }
}

class MeanTransformMapper(resultingMean: Double) extends RichMapFunction[DenseVector, DenseVector] {
 var trainingMean: DenseVector = null

 override def open(parameters: Configuration): Unit = {
 trainingMean = getRuntimeContext().getBroadcastVariable[DenseVector]("trainingMean").get(0)
 }

 override def map(vector: DenseVector): DenseVector = {
 import org.apache.flink.ml.math.Breeze._

 val result = vector.asBreeze  trainingMean.asBreeze + resultingMean

 result.fromBreeze
 }
}
{% endhighlight %}

Now we have everything implemented to fit our `MeanTransformer` to a training data set of `DenseVector` instances and to transform them.
However, when we execute the `fit` operation

{% highlight scala %}
val trainingData: DataSet[DenseVector] = ...
val meanTransformer = MeanTransformer()

meanTransformer.fit(trainingData)
{% endhighlight %}

we receive the following error at runtime: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.math.DenseVector]"`.
The reason is that the Scala compiler could not find a fitting `FitOperation` value with the right type parameters for the implicit parameter of the `fit` method.
Therefore, it chose a fallback implicit value which gives you this error message at runtime.
In order to make the compiler aware of our implementation, we have to define it as an implicit value and put it in the scope of the `MeanTransformer's` companion object.

{% highlight scala %}
object MeanTransformer{
 implicit val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] ...

 implicit val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] ...
}
{% endhighlight %}

Now we can call `fit` and `transform` of our `MeanTransformer` with `DataSet[DenseVector]` as input.
Furthermore, we can now use this transformer as part of an analysis pipeline where we have a `DenseVector` as input and expected output.

{% highlight scala %}
val trainingData: DataSet[DenseVector] = ...

val mean = MeanTransformer.setMean(1.0)
val polyFeatures = PolynomialFeatures().setDegree(3)

val pipeline = mean.chainTransformer(polyFeatures)

pipeline.fit(trainingData)
{% endhighlight %}

It is noteworthy that there is no additional code needed to enable chaining.
The system automatically constructs the pipeline logic using the operations of the individual components.

So far everything works fine with `DenseVector`.
But what happens, if we call our transformer with `LabeledVector` instead?
{% highlight scala %}
val trainingData: DataSet[LabeledVector] = ...

val mean = MeanTransformer()

mean.fit(trainingData)
{% endhighlight %}

As before we see the following exception upon execution of the program: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.common.LabeledVector]"`.
It is noteworthy, that this exception is thrown in the preflight phase, which means that the job has not been submitted to the runtime system.
This has the advantage that you won't see a job which runs for a couple of days and then fails because of an incompatible pipeline component.
Type compatibility is, thus, checked at the very beginning for the complete job.

In order to make the `MeanTransformer` work on `LabeledVector` as well, we have to provide the corresponding operations.
Consequently, we have to define a `FitOperation[MeanTransformer, LabeledVector]` and `TransformOperation[MeanTransformer, LabeledVector, LabeledVector]` as implicit values in the scope of `MeanTransformer`'s companion object.

{% highlight scala %}
object MeanTransformer {
 implicit val labeledVectorFitOperation = new FitOperation[MeanTransformer, LabeledVector] ...

 implicit val labeledVectorTransformOperation = new TransformOperation[MeanTransformer, LabeledVector, LabeledVector] ...
}
{% endhighlight %}

If we wanted to implement a `Predictor` instead of a `Transformer`, then we would have to provide a `FitOperation`, too.
Moreover, a `Predictor` requires a `PredictOperation` which implements how predictions are calculated from testing data.


http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/polynomial_features.md

diff git a/docs/apis/batch/libs/ml/polynomial_features.md b/docs/apis/batch/libs/ml/polynomial_features.md
deleted file mode 100644
index 9ef7654..0000000
 a/docs/apis/batch/libs/ml/polynomial_features.md
+++ /dev/null
@@ 1,111 +0,0 @@

mathjax: include
title: Polynomial Features
# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Polynomial Features

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

The polynomial features transformer maps a vector into the polynomial feature space of degree $d$.
The dimension of the input vector determines the number of polynomial factors whose values are the respective vector entries.
Given a vector $(x, y, z, \ldots)^T$ the resulting feature vector looks like:

$$\left(x, y, z, x^2, xy, y^2, yz, z^2, x^3, x^2y, x^2z, xy^2, xyz, xz^2, y^3, \ldots\right)^T$$

Flink's implementation orders the polynomials in decreasing order of their degree.

Given the vector $\left(3,2\right)^T$, the polynomial features vector of degree 3 would look like

 $$\left(3^3, 3^2\cdot2, 3\cdot2^2, 2^3, 3^2, 3\cdot2, 2^2, 3, 2\right)^T$$

This transformer can be prepended to all `Transformer` and `Predictor` implementations which expect an input of type `LabeledVector` or any subtype of `Vector`.

## Operations

`PolynomialFeatures` is a `Transformer`.
As such, it supports the `fit` and `transform` operation.

### Fit

PolynomialFeatures is not trained on data and, thus, supports all types of input data.

### Transform

PolynomialFeatures transforms all subtypes of `Vector` and `LabeledVector` into their respective types:

* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`

## Parameters

The polynomial features transformer can be controlled by the following parameters:

<table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameters</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>Degree</strong></td>
 <td>
 <p>
 The maximum polynomial degree.
 (Default value: <strong>10</strong>)
 </p>
 </td>
 </tr>
 </tbody>
 </table>

## Examples

{% highlight scala %}
// Obtain the training data set
val trainingDS: DataSet[LabeledVector] = ...

// Setup polynomial feature transformer of degree 3
val polyFeatures = PolynomialFeatures()
.setDegree(3)

// Setup the multiple linear regression learner
val mlr = MultipleLinearRegression()

// Control the learner via the parameter map
val parameters = ParameterMap()
.add(MultipleLinearRegression.Iterations, 20)
.add(MultipleLinearRegression.Stepsize, 0.5)

// Create pipeline PolynomialFeatures > MultipleLinearRegression
val pipeline = polyFeatures.chainPredictor(mlr)

// train the model
pipeline.fit(trainingDS)
{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/quickstart.md

diff git a/docs/apis/batch/libs/ml/quickstart.md b/docs/apis/batch/libs/ml/quickstart.md
deleted file mode 100644
index 60f505e..0000000
 a/docs/apis/batch/libs/ml/quickstart.md
+++ /dev/null
@@ 1,244 +0,0 @@

mathjax: include
title: Quickstart Guide
# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Quickstart Guide

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Introduction

FlinkML is designed to make learning from your data a straightforward process, abstracting away
the complexities that usually come with big data learning tasks. In this
quickstart guide we will show just how easy it is to solve a simple supervised learning problem
using FlinkML. But first some basics, feel free to skip the next few lines if you're already
familiar with Machine Learning (ML).

As defined by Murphy [[1]](#murphy) ML deals with detecting patterns in data, and using those
learned patterns to make predictions about the future. We can categorize most ML algorithms into
two major categories: Supervised and Unsupervised Learning.

* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
(features) to a set of outputs. The learning is done using a *training set* of (input,
output) pairs that we use to approximate the mapping function. Supervised learning problems are
further divided into classification and regression problems. In classification problems we try to
predict the *class* that an example belongs to, for example whether a user is going to click on
an ad or not. Regression problems one the other hand, are about predicting (real) numerical
values, often called the dependent variable, for example what the temperature will be tomorrow.

* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
of this would be *clustering*, where we try to discover groupings of the data from the
descriptive features. Unsupervised learning can also be used for feature selection, for example
through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).

## Linking with FlinkML

In order to use FlinkML in your project, first you have to
[set up a Flink program](http://ci.apache.org/projects/flink/flinkdocsmaster/apis/programming_guide.html#linkingwithflink).
Next, you have to add the FlinkML dependency to the `pom.xml` of your project:

{% highlight xml %}
<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flinkml{{ site.scala_version_suffix }}</artifactId>
 <version>{{site.version }}</version>
</dependency>
{% endhighlight %}

## Loading data

To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
functions for formatted data, such as the LibSVM format. For supervised learning problems it is
common to use the `LabeledVector` class to represent the `(label, features)` examples. A `LabeledVector`
object will have a FlinkML `Vector` member representing the features of the example and a `Double`
member which represents the label, which could be the class in a classification problem, or the dependent
variable for a regression problem.

As an example, we can use Haberman's Survival Data Set , which you can
[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machinelearningdatabases/haberman/haberman.data).
This dataset *"contains cases from a study conducted on the survival of patients who had undergone
surgery for breast cancer"*. The data comes in a commaseparated file, where the first 3 columns
are the features and last column is the class, and the 4th column indicates whether the patient
survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.

We can load the data as a `DataSet[String]` first:

{% highlight scala %}

import org.apache.flink.api.scala.ExecutionEnvironment

val env = ExecutionEnvironment.getExecutionEnvironment

val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")

{% endhighlight %}

We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
is the class label, and the rest are features, so we can build `LabeledVector` elements like this:

{% highlight scala %}

import org.apache.flink.ml.common.LabeledVector
import org.apache.flink.ml.math.DenseVector

val survivalLV = survival
 .map{tuple =>
 val list = tuple.productIterator.toList
 val numList = list.map(_.asInstanceOf[String].toDouble)
 LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
 }

{% endhighlight %}

We can then use this data to train a learner. We will however use another dataset to exemplify
building a learner; that will allow us to show how we can import other dataset formats.

**LibSVM files**

A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
datasets using the LibSVM format through the `readLibSVM` function available through the `MLUtils`
object.
You can also save datasets in the LibSVM format using the `writeLibSVM` function.
Let's import the svmguide1 dataset. You can download the
[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.

We can simply import the dataset then using:

{% highlight scala %}

import org.apache.flink.ml.MLUtils

val astroTrain: DataSet[LabeledVector] = MLUtils.readLibSVM("/path/to/svmguide1")
val astroTest: DataSet[LabeledVector] = MLUtils.readLibSVM("/path/to/svmguide1.t")

{% endhighlight %}

This gives us two `DataSet[LabeledVector]` objects that we will use in the following section to
create a classifier.

## Classification

Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
regularization parameter determines the amount of $l_2$ regularization applied, which is used
to avoid overfitting. The step size determines the contribution of the weight vector updates to
the next weight vector value. This parameter sets the initial step size.

{% highlight scala %}

import org.apache.flink.ml.classification.SVM

val svm = SVM()
 .setBlocks(env.getParallelism)
 .setIterations(100)
 .setRegularization(0.001)
 .setStepsize(0.1)
 .setSeed(42)

svm.fit(astroTrain)

{% endhighlight %}

We can now make predictions on the test set.

{% highlight scala %}

val predictionPairs = svm.predict(astroTest)

{% endhighlight %}

Next we will see how we can preprocess our data, and use the ML pipelines capabilities of FlinkML.

## Data preprocessing and pipelines

A preprocessing step that is often encouraged [[3]](#hsu) when using SVM classification is scaling
the input features to the [0, 1] range, in order to avoid features with extreme values
dominating the rest.
FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to preprocess data,
and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows
us to run the same pipeline of transformations and make predictions on the train and test data in
a straightforward and typesafe manner. You can read more on the pipeline system of FlinkML
[in the pipelines documentation](pipelines.html).

Let us first create a normalizing transformer for the features in our dataset, and chain it to a
new SVM classifier.

{% highlight scala %}

import org.apache.flink.ml.preprocessing.MinMaxScaler

val scaler = MinMaxScaler()

val scaledSVM = scaler.chainPredictor(svm)

{% endhighlight %}

We can now use our newly created pipeline to make predictions on the test set.
First we call fit again, to train the scaler and the SVM classifier.
The data of the test set will then be automatically scaled before being passed on to the SVM to
make predictions.

{% highlight scala %}

scaledSVM.fit(astroTrain)

val predictionPairsScaled: DataSet[(Double, Double)] = scaledSVM.predict(astroTest)

{% endhighlight %}

The scaled inputs should give us better prediction performance.
The result of the prediction on `LabeledVector`s is a data set of tuples where the first entry denotes the true label value and the second entry is the predicted label value.

## Where to go from here

This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot
more you can do.
We recommend going through the [FlinkML documentation](index.html), and trying out the different
algorithms.
A very good way to get started is to play around with interesting datasets from the UCI ML
repository and the LibSVM datasets.
Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or
[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other
data scientists.
If you would like to contribute some new algorithms take a look at our
[contribution guide](contribution_guide.html).

**References**

<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
press, 2012.

<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communicationefficient distributed dual
coordinate ascent.* Advances in Neural Information Processing Systems. 2014.

<a name="hsu"></a>[3] Hsu, ChihWei, ChihChung Chang, and ChihJen Lin.
 *A practical guide to support vector classification.* 2003.
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/standard_scaler.md

diff git a/docs/apis/batch/libs/ml/standard_scaler.md b/docs/apis/batch/libs/ml/standard_scaler.md
deleted file mode 100644
index 3a9cd4b..0000000
 a/docs/apis/batch/libs/ml/standard_scaler.md
+++ /dev/null
@@ 1,116 +0,0 @@

mathjax: include
title: Standard Scaler
# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Standard Scaler

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

 The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
 In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1.
 Given a set of input data $x_1, x_2,... x_n$, with mean:

 $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$

 and standard deviation:

 $$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}\bar{x})^{2}}$$

The scaled data set $z_1, z_2,...,z_n$ will be:

 $$z_{i}= std \left (\frac{x_{i}  \bar{x} }{\sigma_{x}}\right ) + mean$$

where $\textit{std}$ and $\textit{mean}$ are the user specified values for the standard deviation and mean.

## Operations

`StandardScaler` is a `Transformer`.
As such, it supports the `fit` and `transform` operation.

### Fit

StandardScaler is trained on all subtypes of `Vector` or `LabeledVector`:

* `fit[T <: Vector]: DataSet[T] => Unit`
* `fit: DataSet[LabeledVector] => Unit`

### Transform

StandardScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:

* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`

## Parameters

The standard scaler implementation can be controlled by the following two parameters:

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameters</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>Mean</strong></td>
 <td>
 <p>
 The mean of the scaled data set. (Default value: <strong>0.0</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Std</strong></td>
 <td>
 <p>
 The standard deviation of the scaled data set. (Default value: <strong>1.0</strong>)
 </p>
 </td>
 </tr>
 </tbody>
</table>

## Examples

{% highlight scala %}
// Create standard scaler transformer
val scaler = StandardScaler()
.setMean(10.0)
.setStd(2.0)

// Obtain data set to be scaled
val dataSet: DataSet[Vector] = ...

// Learn the mean and standard deviation of the training data
scaler.fit(dataSet)

// Scale the provided data set to have mean=10.0 and std=2.0
val scaledDS = scaler.transform(dataSet)
{% endhighlight %}
