http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/index.md

diff git a/docs/dev/libs/ml/index.md b/docs/dev/libs/ml/index.md
new file mode 100644
index 0000000..d01e18e
 /dev/null
+++ b/docs/dev/libs/ml/index.md
@@ 0,0 +1,144 @@
+
+title: "FlinkML  Machine Learning for Flink"
+navid: ml
+navshow_overview: true
+navtitle: Machine Learning
+navparent_id: libs
+navpos: 4
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
+with a growing list of algorithms and contributors. With FlinkML we aim to provide
+scalable ML algorithms, an intuitive API, and tools that help minimize glue code in endtoend ML
+systems. You can see more details about our goals and where the library is headed in our [vision
+and roadmap here](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap).
+
+* This will be replaced by the TOC
+{:toc}
+
+## Supported Algorithms
+
+FlinkML currently supports the following algorithms:
+
+### Supervised Learning
+
+* [SVM using Communication efficient distributed dual coordinate ascent (CoCoA)](svm.html)
+* [Multiple linear regression](multiple_linear_regression.html)
+* [Optimization Framework](optimization.html)
+
+### Unsupervised Learning
+
+* [kNearest neighbors join](knn.html)
+
+### Data Preprocessing
+
+* [Polynomial Features](polynomial_features.html)
+* [Standard Scaler](standard_scaler.html)
+* [MinMax Scaler](min_max_scaler.html)
+
+### Recommendation
+
+* [Alternating Least Squares (ALS)](als.html)
+
+### Utilities
+
+* [Distance Metrics](distance_metrics.html)
+* [Cross Validation](cross_validation.html)
+
+## Getting Started
+
+You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started
+example.
+
+If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/dev/api_concepts.html#linkingwithflink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your project.
+
+{% highlight xml %}
+<dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flinkml{{ site.scala_version_suffix }}</artifactId>
+ <version>{{site.version }}</version>
+</dependency>
+{% endhighlight %}
+
+Note that FlinkML is currently not part of the binary distribution.
+See linking with it for cluster execution [here]({{site.baseurl}}/dev/cluster_execution.html#linkingwithmodulesnotcontainedinthebinarydistribution).
+
+Now you can start solving your analysis task.
+The following code snippet shows how easy it is to train a multiple linear regression model.
+
+{% highlight scala %}
+// LabeledVector is a feature vector with a label (class or real value)
+val trainingData: DataSet[LabeledVector] = ...
+val testingData: DataSet[Vector] = ...
+
+// Alternatively, a Splitter is used to break up a DataSet into training and testing data.
+val dataSet: DataSet[LabeledVector] = ...
+val trainTestData: DataSet[TrainTestDataSet] = Splitter.trainTestSplit(dataSet)
+val trainingData: DataSet[LabeledVector] = trainTestData.training
+val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)
+
+val mlr = MultipleLinearRegression()
+ .setStepsize(1.0)
+ .setIterations(100)
+ .setConvergenceThreshold(0.001)
+
+mlr.fit(trainingData)
+
+// The fitted model can now be used to make predictions
+val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
+{% endhighlight %}
+
+## Pipelines
+
+A key concept of FlinkML is its [scikitlearn](http://scikitlearn.org) inspired pipelining mechanism.
+It allows you to quickly build complex data analysis pipelines how they appear in every data scientist's daily work.
+An indepth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html).
+
+The following example code shows how easy it is to set up an analysis pipeline with FlinkML.
+
+{% highlight scala %}
+val trainingData: DataSet[LabeledVector] = ...
+val testingData: DataSet[Vector] = ...
+
+val scaler = StandardScaler()
+val polyFeatures = PolynomialFeatures().setDegree(3)
+val mlr = MultipleLinearRegression()
+
+// Construct pipeline of standard scaler, polynomial features and multiple linear regression
+val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
+
+// Train pipeline
+pipeline.fit(trainingData)
+
+// Calculate predictions
+val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
+{% endhighlight %}
+
+One can chain a `Transformer` to another `Transformer` or a set of chained `Transformers` by calling the method `chainTransformer`.
+If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Transformers`, one has to call the method `chainPredictor`.
+
+
+## How to contribute
+
+The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries.
+In order to get quickly started with contributing to FlinkML, please read our official
+[contribution guide]({{site.baseurl}}/dev/libs/ml/contribution_guide.html).
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/knn.md

diff git a/docs/dev/libs/ml/knn.md b/docs/dev/libs/ml/knn.md
new file mode 100644
index 0000000..0d3ca9a
 /dev/null
+++ b/docs/dev/libs/ml/knn.md
@@ 0,0 +1,144 @@
+
+mathjax: include
+title: kNearest Neighbors Join
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact knearest neighbors join algorithm. Given a training set $A$ and a testing set $B$, the algorithm returns
+
+$$
+KNNJ(A, B, k) = \{ \left( b, KNN(b, A, k) \right) \text{ where } b \in B \text{ and } KNN(b, A, k) \text{ are the knearest points to }b\text{ in }A \}
+$$
+
+The bruteforce approach is to compute the distance between every training and testing point. To ease the bruteforce computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.
+
+## Operations
+
+`KNN` is a `Predictor`.
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained by a given set of `Vector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where the `(T, Array[Vector])` tuple
+ corresponds to (test point, Knearest training points)
+
+## Parameters
+
+The KNN implementation can be controlled by the following parameters:
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameters</th>
+ <th class="textcenter">Description</th>
+ </tr>
+ </thead>
+
+ <tbody>
+ <tr>
+ <td><strong>K</strong></td>
+ <td>
+ <p>
+ Defines the number of nearestneighbors to search for. That is, for each test point, the algorithm finds the Knearest neighbors in the training set
+ (Default value: <strong>5</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>DistanceMetric</strong></td>
+ <td>
+ <p>
+ Sets the distance metric we use to calculate the distance between two points. If no metric is specified, then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+ (Default value: <strong>EuclideanDistanceMetric</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Blocks</strong></td>
+ <td>
+ <p>
+ Sets the number of blocks into which the input data will be split. This number should be set
+ at least to the degree of parallelism. If no value is specified, then the parallelism of the
+ input [[DataSet]] is used as the number of blocks.
+ (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>UseQuadTree</strong></td>
+ <td>
+ <p>
+ A boolean variable that whether or not to use a quadtree to partition the training set to potentially simplify the KNN search. If no value is specified, the code will automatically decide whether or not to use a quadtree. Use of a quadtree scales well with the number of training and testing points, though poorly with the dimension.
+ (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>SizeHint</strong></td>
+ <td>
+ <p>Specifies whether the training set or test set is small to optimize the cross product operation needed for the KNN search. If the training set is small this should be `CrossHint.FIRST_IS_SMALL` and set to `CrossHint.SECOND_IS_SMALL` if the test set is small.
+ (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ </tbody>
+ </table>
+
+## Examples
+
+{% highlight scala %}
+import org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.nn.KNN
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+// prepare data
+val trainingSet: DataSet[Vector] = ...
+val testingSet: DataSet[Vector] = ...
+
+val knn = KNN()
+ .setK(3)
+ .setBlocks(10)
+ .setDistanceMetric(SquaredEuclideanDistanceMetric())
+ .setUseQuadTree(false)
+ .setSizeHint(CrossHint.SECOND_IS_SMALL)
+
+// run knn join
+knn.fit(trainingSet)
+val result = knn.predict(testingSet).collect()
+{% endhighlight %}
+
+For more details on the computing KNN with and without and quadtree, here is a presentation: [http://danielblazevski.github.io/](http://danielblazevski.github.io/)
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/min_max_scaler.md

diff git a/docs/dev/libs/ml/min_max_scaler.md b/docs/dev/libs/ml/min_max_scaler.md
new file mode 100644
index 0000000..35376c3
 /dev/null
+++ b/docs/dev/libs/ml/min_max_scaler.md
@@ 0,0 +1,112 @@
+
+mathjax: include
+title: MinMax Scaler
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i}  x_{min}}{x_{max}  x_{min}} \left ( max  min \right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two parameters:
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameters</th>
+ <th class="textcenter">Description</th>
+ </tr>
+ </thead>
+
+ <tbody>
+ <tr>
+ <td><strong>Min</strong></td>
+ <td>
+ <p>
+ The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Max</strong></td>
+ <td>
+ <p>
+ The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
+ </p>
+ </td>
+ </tr>
+ </tbody>
+</table>
+
+## Examples
+
+{% highlight scala %}
+// Create MinMax scaler transformer
+val minMaxscaler = MinMaxScaler()
+ .setMin(1.0)
+
+// Obtain data set to be scaled
+val dataSet: DataSet[Vector] = ...
+
+// Learn the minimum and maximum values of the training data
+minMaxscaler.fit(dataSet)
+
+// Scale the provided data set to have min=1.0 and max=1.0
+val scaledDS = minMaxscaler.transform(dataSet)
+{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/multiple_linear_regression.md

diff git a/docs/dev/libs/ml/multiple_linear_regression.md b/docs/dev/libs/ml/multiple_linear_regression.md
new file mode 100644
index 0000000..95ee85f
 /dev/null
+++ b/docs/dev/libs/ml/multiple_linear_regression.md
@@ 0,0 +1,160 @@
+
+mathjax: include
+title: Multiple Linear Regression
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ Multiple linear regression tries to find a linear function which best fits the provided input data.
+ Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds
+ a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized:
+
+ $$ S(\mathbf{w}) = \sum_{i=1} \left(y  \mathbf{w}^T\mathbf{x_i} \right)^2$$
+
+ Written in matrix notation, we obtain the following formulation:
+
+ $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y}  X\mathbf{w})^2$$
+
+ This problem has a closed form solution which is given by:
+
+ $$\mathbf{w}^* = \left(X^TX\right)^{1}X^T\mathbf{y}$$
+
+ However, in cases where the input data set is so huge that a complete parse over the whole data
+ set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution.
+ SGD first calculates for a random subset of the input data set the gradients. The gradient
+ for a given point $\mathbf{x}_i$ is given by:
+
+ $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} 
+ y\right)\mathbf{x_i}$$
+
+ The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$
+ with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the
+ current weight vector giving the new weight vector for the next iteration:
+
+ $$\mathbf{w}_{t+1} = \mathbf{w}_t  \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$
+
+ The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion.
+ The convergence criterion is the relative change in the sum of squared residuals:
+
+ $$\frac{S_{k1}  S_k}{S_{k1}} < \rho$$
+
+## Operations
+
+`MultipleLinearRegression` is a `Predictor`.
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+MultipleLinearRegression is trained on a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[LabeledVector]`
+
+If we call predict with a `DataSet[LabeledVector]`, we make a prediction on the regression value
+for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element
+is the true value, as was provided from the input `DataSet[LabeledVector]` and the second element
+is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate
+the algorithm's performance.
+
+* `predict: DataSet[LabeledVector] => DataSet[(Double, Double)]`
+
+## Parameters
+
+ The multiple linear regression implementation can be controlled by the following parameters:
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameters</th>
+ <th class="textcenter">Description</th>
+ </tr>
+ </thead>
+
+ <tbody>
+ <tr>
+ <td><strong>Iterations</strong></td>
+ <td>
+ <p>
+ The maximum number of iterations. (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Stepsize</strong></td>
+ <td>
+ <p>
+ Initial step size for the gradient descent method.
+ This value controls how far the gradient descent method moves in the opposite direction of the gradient.
+ Tuning this parameter might be crucial to make it stable and to obtain a better performance.
+ (Default value: <strong>0.1</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>ConvergenceThreshold</strong></td>
+ <td>
+ <p>
+ Threshold for relative change of the sum of squared residuals until the iteration is stopped.
+ (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>LearningRateMethod</strong></td>
+ <td>
+ <p>
+ Learning rate method used to calculate the effective learning rate for each iteration.
+ See the list of supported <a href="optimization.html">learning rate methods</a>.
+ (Default value: <strong>LearningRateMethod.Default</strong>)
+ </p>
+ </td>
+ </tr>
+ </tbody>
+ </table>
+
+## Examples
+
+{% highlight scala %}
+// Create multiple linear regression learner
+val mlr = MultipleLinearRegression()
+.setIterations(10)
+.setStepsize(0.5)
+.setConvergenceThreshold(0.001)
+
+// Obtain training and testing data set
+val trainingDS: DataSet[LabeledVector] = ...
+val testingDS: DataSet[Vector] = ...
+
+// Fit the linear model to the provided data
+mlr.fit(trainingDS)
+
+// Calculate the predictions for the test data
+val predictions = mlr.predict(testingDS)
+{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/optimization.md

diff git a/docs/dev/libs/ml/optimization.md b/docs/dev/libs/ml/optimization.md
new file mode 100644
index 0000000..e3e2f63
 /dev/null
+++ b/docs/dev/libs/ml/optimization.md
@@ 0,0 +1,382 @@
+
+mathjax: include
+title: Optimization
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* Table of contents
+{:toc}
+
+## Mathematical Formulation
+
+The optimization framework in FlinkML is a developeroriented package that can be used to solve
+[optimization](https://en.wikipedia.org/wiki/Mathematical_optimization)
+problems common in Machine Learning (ML) tasks. In the supervised learning context, this usually
+involves finding a model, as defined by a set of parameters $w$, that minimize a function $f(\wv)$
+given a set of $(\x, y)$ examples,
+where $\x$ is a feature vector and $y$ is a real number, which can represent either a real value in
+the regression case, or a class label in the classification case. In supervised learning, the
+function to be minimized is usually of the form:
+
+
+\begin{equation} \label{eq:objectiveFunc}
+ f(\wv) :=
+ \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
+ \lambda\, R(\wv)
+ \ .
+\end{equation}
+
+
+where $L$ is the loss function and $R(\wv)$ the regularization penalty. We use $L$ to measure how
+well the model fits the observed data, and we use $R$ in order to impose a complexity cost to the
+model, with $\lambda > 0$ being the regularization parameter.
+
+### Loss Functions
+
+In supervised learning, we use loss functions in order to measure the model fit, by
+penalizing errors in the predictions $p$ made by the model compared to the true $y$ for each
+example. Different loss functions can be used for regression (e.g. Squared Loss) and classification
+(e.g. Hinge Loss) tasks.
+
+Some common loss functions are:
+
+* Squared Loss: $ \frac{1}{2} \left(\wv^T \cdot \x  y\right)^2, \quad y \in \R $
+* Hinge Loss: $ \max \left(0, 1  y ~ \wv^T \cdot \x\right), \quad y \in \{1, +1\} $
+* Logistic Loss: $ \log\left(1+\exp\left( y ~ \wv^T \cdot \x\right)\right), \quad y \in \{1, +1\}$
+
+### Regularization Types
+
+[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) in machine learning
+imposes penalties to the estimated models, in order to reduce overfitting. The most common penalties
+are the $L_1$ and $L_2$ penalties, defined as:
+
+* $L_1$: $R(\wv) = \norm{\wv}_1$
+* $L_2$: $R(\wv) = \frac{1}{2}\norm{\wv}_2^2$
+
+The $L_2$ penalty penalizes large weights, favoring solutions with more small weights rather than
+few large ones.
+The $L_1$ penalty can be used to drive a number of the solution coefficients to 0, thereby
+producing sparse solutions.
+The regularization constant $\lambda$ in $\eqref{eq:objectiveFunc}$ determines the amount of regularization applied to the model,
+and is usually determined through model crossvalidation.
+A good comparison of regularization types can be found in [this](http://www.robotics.stanford.edu/~ang/papers/icml04l1l2.pdf) paper by Andrew Ng.
+Which regularization type is supported depends on the actually used optimization algorithm.
+
+## Stochastic Gradient Descent
+
+In order to find a (local) minimum of a function, Gradient Descent methods take steps in the
+direction opposite to the gradient of the function $\eqref{eq:objectiveFunc}$ taken with
+respect to the current parameters (weights).
+In order to compute the exact gradient we need to perform one pass through all the points in
+a dataset, making the process computationally expensive.
+An alternative is Stochastic Gradient Descent (SGD) where at each iteration we sample one point
+from the complete dataset and update the parameters for each point, in an online manner.
+
+In minibatch SGD we instead sample random subsets of the dataset, and compute the gradient
+over each batch. At each iteration of the algorithm we update the weights once, based on
+the average of the gradients computed from each minibatch.
+
+An important parameter is the learning rate $\eta$, or step size, which can be determined by one of five methods, listed below. The setting of the initial step size can significantly affect the performance of the
+algorithm. For some practical tips on tuning SGD see Leon Botou's
+"[Stochastic Gradient Descent Tricks](http://research.microsoft.com/pubs/192769/tricks2012.pdf)".
+
+The current implementation of SGD uses the whole partition, making it
+effectively a batch gradient descent. Once a sampling operator has been introduced in Flink, true
+minibatch SGD will be performed.
+
+### Regularization
+
+FlinkML supports Stochastic Gradient Descent with L1, L2 and no regularization.
+The following list contains a mapping between the implementing classes and the regularization function.
+
+<table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Class Name</th>
+ <th class="textcenter">Regularization function $R(\wv)$</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><code>SimpleGradient</code></td>
+ <td>$R(\wv) = 0$</td>
+ </tr>
+ <tr>
+ <td><code>GradientDescentL1</code></td>
+ <td>$R(\wv) = \norm{\wv}_1$</td>
+ </tr>
+ <tr>
+ <td><code>GradientDescentL2</code></td>
+ <td>$R(\wv) = \frac{1}{2}\norm{\wv}_2^2$</td>
+ </tr>
+ </tbody>
+</table>
+
+### Parameters
+
+ The stochastic gradient descent implementation can be controlled by the following parameters:
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameter</th>
+ <th class="textcenter">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><strong>LossFunction</strong></td>
+ <td>
+ <p>
+ The loss function to be optimized. (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>RegularizationConstant</strong></td>
+ <td>
+ <p>
+ The amount of regularization to apply. (Default value: <strong>0.1</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Iterations</strong></td>
+ <td>
+ <p>
+ The maximum number of iterations. (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>LearningRate</strong></td>
+ <td>
+ <p>
+ Initial learning rate for the gradient descent method.
+ This value controls how far the gradient descent method moves in the opposite direction
+ of the gradient.
+ (Default value: <strong>0.1</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>ConvergenceThreshold</strong></td>
+ <td>
+ <p>
+ When set, iterations stop if the relative change in the value of the objective function $\eqref{eq:objectiveFunc}$ is less than the provided threshold, $\tau$.
+ The convergence criterion is defined as follows: $\left \frac{f(\wv)_{i1}  f(\wv)_i}{f(\wv)_{i1}}\right < \tau$.
+ (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>LearningRateMethod</strong></td>
+ <td>
+ <p>
+ (Default value: <strong>LearningRateMethod.Default</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Decay</strong></td>
+ <td>
+ <p>
+ (Default value: <strong>0.0</strong>)
+ </p>
+ </td>
+ </tr>
+ </tbody>
+ </table>
+
+### Loss Function
+
+The loss function which is minimized has to implement the `LossFunction` interface, which defines methods to compute the loss and the gradient of it.
+Either one defines ones own `LossFunction` or one uses the `GenericLossFunction` class which constructs the loss function from an outer loss function and a prediction function.
+An example can be seen here
+
+```Scala
+val lossFunction = GenericLossFunction(SquaredLoss, LinearPrediction)
+```
+
+The full list of supported outer loss functions can be found [here](#partiallossfunctionvalues).
+The full list of supported prediction functions can be found [here](#predictionfunctionvalues).
+
+#### Partial Loss Function Values ##
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Function Name</th>
+ <th class="textcenter">Description</th>
+ <th class="textcenter">Loss</th>
+ <th class="textcenter">Loss Derivative</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><strong>SquaredLoss</strong></td>
+ <td>
+ <p>
+ Loss function most commonly used for regression tasks.
+ </p>
+ </td>
+ <td class="textcenter">$\frac{1}{2} (\wv^T \cdot \x  y)^2$</td>
+ <td class="textcenter">$\wv^T \cdot \x  y$</td>
+ </tr>
+ </tbody>
+ </table>
+
+#### Prediction Function Values ##
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Function Name</th>
+ <th class="textcenter">Description</th>
+ <th class="textcenter">Prediction</th>
+ <th class="textcenter">Prediction Gradient</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><strong>LinearPrediction</strong></td>
+ <td>
+ <p>
+ The function most commonly used for linear models, such as linear regression and
+ linear classifiers.
+ </p>
+ </td>
+ <td class="textcenter">$\x^T \cdot \wv$</td>
+ <td class="textcenter">$\x$</td>
+ </tr>
+ </tbody>
+ </table>
+
+#### Effective Learning Rate ##
+
+Where:
+
+ $j$ is the iteration number
+
+ $\eta_j$ is the step size on step $j$
+
+ $\eta_0$ is the initial step size
+
+ $\lambda$ is the regularization constant
+
+ $\tau$ is the decay constant, which causes the learning rate to be a decreasing function of $j$, that is to say as iterations increase, learning rate decreases. The exact rate of decay is function specific, see **Inverse Scaling** and **Wei Xu's Method** (which is an extension of the **Inverse Scaling** method).
+
+<table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Function Name</th>
+ <th class="textcenter">Description</th>
+ <th class="textcenter">Function</th>
+ <th class="textcenter">Called As</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><strong>Default</strong></td>
+ <td>
+ <p>
+ The function default method used for determining the step size. This is equivalent to the inverse scaling method for $\tau$ = 0.5. This special case is kept as the default to maintain backwards compatibility.
+ </p>
+ </td>
+ <td class="textcenter">$\eta_j = \eta_0/\sqrt{j}$</td>
+ <td class="textcenter"><code>LearningRateMethod.Default</code></td>
+ </tr>
+ <tr>
+ <td><strong>Constant</strong></td>
+ <td>
+ <p>
+ The step size is constant throughout the learning task.
+ </p>
+ </td>
+ <td class="textcenter">$\eta_j = \eta_0$</td>
+ <td class="textcenter"><code>LearningRateMethod.Constant</code></td>
+ </tr>
+ <tr>
+ <td><strong>Leon Bottou's Method</strong></td>
+ <td>
+ <p>
+ This is the <code>'optimal'</code> method of sklearn.
+ The optimal initial value $t_0$ has to be provided.
+ Sklearn uses the following heuristic: $t_0 = \max(1.0, L^\prime(\beta, 1.0) / (\alpha \cdot \beta)$
+ with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
+ </p>
+ </td>
+ <td class="textcenter">$\eta_j = 1 / (\lambda \cdot (t_0 + j 1)) $</td>
+ <td class="textcenter"><code>LearningRateMethod.Bottou</code></td>
+ </tr>
+ <tr>
+ <td><strong>Inverse Scaling</strong></td>
+ <td>
+ <p>
+ A very common method for determining the step size.
+ </p>
+ </td>
+ <td class="textcenter">$\eta_j = \eta_0 / j^{\tau}$</td>
+ <td class="textcenter"><code>LearningRateMethod.InvScaling</code></td>
+ </tr>
+ <tr>
+ <td><strong>Wei Xu's Method</strong></td>
+ <td>
+ <p>
+ Method proposed by Wei Xu in <a href="http://arxiv.org/pdf/1107.2490.pdf">Towards Optimal One Pass Large Scale Learning with
+ Averaged Stochastic Gradient Descent</a>
+ </p>
+ </td>
+ <td class="textcenter">$\eta_j = \eta_0 \cdot (1+ \lambda \cdot \eta_0 \cdot j)^{\tau} $</td>
+ <td class="textcenter"><code>LearningRateMethod.Xu</code></td>
+ </tr>
+ </tbody>
+ </table>
+
+### Examples
+
+In the Flink implementation of SGD, given a set of examples in a `DataSet[LabeledVector]` and
+optionally some initial weights, we can use `GradientDescentL1.optimize()` in order to optimize
+the weights for the given data.
+
+The user can provide an initial `DataSet[WeightVector]`,
+which contains one `WeightVector` element, or use the default weights which are all set to 0.
+A `WeightVector` is a container class for the weights, which separates the intercept from the
+weight vector. This allows us to avoid applying regularization to the intercept.
+
+
+
+{% highlight scala %}
+// Create stochastic gradient descent solver
+val sgd = GradientDescentL1()
+ .setLossFunction(SquaredLoss())
+ .setRegularizationConstant(0.2)
+ .setIterations(100)
+ .setLearningRate(0.01)
+ .setLearningRateMethod(LearningRateMethod.Xu(0.75))
+
+
+// Obtain data
+val trainingDS: DataSet[LabeledVector] = ...
+
+// Optimize the weights, according to the provided data
+val weightDS = sgd.optimize(trainingDS)
+{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/pipelines.md

diff git a/docs/dev/libs/ml/pipelines.md b/docs/dev/libs/ml/pipelines.md
new file mode 100644
index 0000000..e0f7d82
 /dev/null
+++ b/docs/dev/libs/ml/pipelines.md
@@ 0,0 +1,441 @@
+
+mathjax: include
+title: Looking under the hood of pipelines
+navtitle: Pipelines
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Introduction
+
+The ability to chain together different transformers and predictors is an important feature for
+any Machine Learning (ML) library. In FlinkML we wanted to provide an intuitive API,
+and at the same
+time utilize the capabilities of the Scala language to provide
+typesafe implementations of our pipelines. What we hope to achieve then is an easy to use API,
+that protects users from type errors at preflight (before the job is launched) time, thereby
+eliminating cases where long
+running jobs are submitted to the cluster only to see them fail due to some
+error in the series of data transformations that commonly happen in an ML pipeline.
+
+In this guide then we will describe the choices we made during the implementation of chainable
+transformers and predictors in FlinkML, and provide guidelines on how developers can create their
+own algorithms that make use of these capabilities.
+
+## The what and the why
+
+So what do we mean by "ML pipelines"? Pipelines in the ML context can be thought of as chains of
+operations that have some data as input, perform a number of transformations to that data,
+and
+then output the transformed data, either to be used as the input (features) of a predictor
+function, such as a learning model, or just output the transformed data themselves, to be used in
+some other task. The end learner can of course be a part of the pipeline as well.
+ML pipelines can often be complicated sets of operations ([indepth explanation](http://research.google.com/pubs/pub43146.html)) and
+can become sources of errors for endtoend learning systems.
+
+The purpose of ML pipelines is then to create a
+framework that can be used to manage the complexity introduced by these chains of operations.
+Pipelines should make it easy for developers to define chained transformations that can be
+applied to the
+training data, in order to create the end features that will be used to train a
+learning model, and then perform the same set of transformations just as easily to unlabeled
+(test) data. Pipelines should also simplify crossvalidation and model selection on
+these chains of operations.
+
+Finally, by ensuring that the consecutive links in the pipeline chain "fit together" we also
+avoid costly type errors. Since each step in a pipeline can be a computationallyheavy operation,
+we want to avoid running a pipelined job, unless we are sure that all the input/output pairs in a
+pipeline "fit".
+
+## Pipelines in FlinkML
+
+The building blocks for pipelines in FlinkML can be found in the `ml.pipeline` package.
+FlinkML follows an API inspired by [sklearn](http://scikitlearn.org) which means that we have
+`Estimator`, `Transformer` and `Predictor` interfaces. For an indepth look at the design of the
+sklearn API the interested reader is referred to [this](http://arxiv.org/abs/1309.0238) paper.
+In short, the `Estimator` is the base class from which `Transformer` and `Predictor` inherit.
+`Estimator` defines a `fit` method, and `Transformer` also defines a `transform` method and
+`Predictor` defines a `predict` method.
+
+The `fit` method of the `Estimator` performs the actual training of the model, for example
+finding the correct weights in a linear regression task, or the mean and standard deviation of
+the data in a feature scaler.
+As evident by the naming, classes that implement
+`Transformer` are transform operations like [scaling the input](standard_scaler.html) and
+`Predictor` implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/dev/libs/ml/multiple_linear_regression.html).
+Pipelines can be created by chaining together a number of Transformers, and the final link in a pipeline can be a Predictor or another Transformer.
+Pipelines that end with Predictor cannot be chained any further.
+Below is an example of how a pipeline can be formed:
+
+{% highlight scala %}
+// Training data
+val input: DataSet[LabeledVector] = ...
+// Test data
+val unlabeled: DataSet[Vector] = ...
+
+val scaler = StandardScaler()
+val polyFeatures = PolynomialFeatures()
+val mlr = MultipleLinearRegression()
+
+// Construct the pipeline
+val pipeline = scaler
+ .chainTransformer(polyFeatures)
+ .chainPredictor(mlr)
+
+// Train the pipeline (scaler and multiple linear regression)
+pipeline.fit(input)
+
+// Calculate predictions for the testing data
+val predictions: DataSet[LabeledVector] = pipeline.predict(unlabeled)
+
+{% endhighlight %}
+
+As we mentioned, FlinkML pipelines are typesafe.
+If we tried to chain a transformer with output of type `A` to another with input of type `B` we
+would get an error at preflight time if `A` != `B`. FlinkML achieves this kind of typesafety
+through the use of Scala's implicits.
+
+### Scala implicits
+
+If you are not familiar with Scala's implicits we can recommend [this excerpt](https://www.artima.com/pins1ed/implicitconversionsandparameters.html)
+from Martin Odersky's "Programming in Scala". In short, implicit conversions allow for adhoc
+polymorphism in Scala by providing conversions from one type to another, and implicit values
+provide the compiler with default values that can be supplied to function calls through implicit parameters.
+The combination of implicit conversions and implicit parameters is what allows us to chain transform
+and predict operations together in a typesafe manner.
+
+### Operations
+
+As we mentioned, the trait (abstract class) `Estimator` defines a `fit` method. The method has two
+parameter lists
+(i.e. is a [curried function](http://docs.scalalang.org/tutorials/tour/currying.html)). The
+first parameter list
+takes the input (training) `DataSet` and the parameters for the estimator. The second parameter
+list takes one `implicit` parameter, of type `FitOperation`. `FitOperation` is a class that also
+defines a `fit` method, and this is where the actual logic of training the concrete Estimators
+should be implemented. The `fit` method of `Estimator` is essentially a wrapper around the fit
+method of `FitOperation`. The `predict` method of `Predictor` and the `transform` method of
+`Transform` are designed in a similar manner, with a respective operation class.
+
+In these methods the operation object is provided as an implicit parameter.
+Scala will [look for implicits](http://docs.scalalang.org/tutorials/FAQ/findingimplicits.html)
+in the companion object of a type, so classes that implement these interfaces should provide these
+objects as implicit objects inside the companion object.
+
+As an example we can look at the `StandardScaler` class. `StandardScaler` extends `Transformer`, so it has access to its `fit` and `transform` functions.
+These two functions expect objects of `FitOperation` and `TransformOperation` as implicit parameters,
+for the `fit` and `transform` methods respectively, which `StandardScaler` provides in its companion
+object, through `transformVectors` and `fitVectorStandardScaler`:
+
+{% highlight scala %}
+class StandardScaler extends Transformer[StandardScaler] {
+ ...
+}
+
+object StandardScaler {
+
+ ...
+
+ implicit def fitVectorStandardScaler[T <: Vector] = new FitOperation[StandardScaler, T] {
+ override def fit(instance: StandardScaler, fitParameters: ParameterMap, input: DataSet[T])
+ : Unit = {
+ ...
+ }
+
+ implicit def transformVectors[T <: Vector: VectorConverter: TypeInformation: ClassTag] = {
+ new TransformOperation[StandardScaler, T, T] {
+ override def transform(
+ instance: StandardScaler,
+ transformParameters: ParameterMap,
+ input: DataSet[T])
+ : DataSet[T] = {
+ ...
+ }
+
+}
+
+{% endhighlight %}
+
+Note that `StandardScaler` does **not** override the `fit` method of `Estimator` or the `transform`
+method of `Transformer`. Rather, its implementations of `FitOperation` and `TransformOperation`
+override their respective `fit` and `transform` methods, which are then called by the `fit` and
+`transform` methods of `Estimator` and `Transformer`. Similarly, a class that implements
+`Predictor` should define an implicit `PredictOperation` object inside its companion object.
+
+#### Types and type safety
+
+Apart from the `fit` and `transform` operations that we listed above, the `StandardScaler` also
+provides `fit` and `transform` operations for input of type `LabeledVector`.
+This allows us to use the algorithm for input that is labeled or unlabeled, and this happens
+automatically, depending on the type of the input that we give to the fit and transform
+operations. The correct implicit operation is chosen by the compiler, depending on the input type.
+
+If we try to call the `fit` or `transform` methods with types that are not supported we will get a
+runtime error before the job is launched.
+While it would be possible to catch these kinds of errors at compile time as well, the error
+messages that we are able to provide the user would be much less informative, which is why we chose
+to throw runtime exceptions instead.
+
+### Chaining
+
+Chaining is achieved by calling `chainTransformer` or `chainPredictor` on an object
+of a class that implements `Transformer`. These methods return a `ChainedTransformer` or
+`ChainedPredictor` object respectively. As we mentioned, `ChainedTransformer` objects can be
+chained further, while `ChainedPredictor` objects cannot. These classes take care of applying
+fit, transform, and predict operations for a pair of successive transformers or
+a transformer and a predictor. They also act recursively if the length of the
+chain is larger than two, since every `ChainedTransformer` defines a `transform` and `fit`
+operation that can be further chained with more transformers or a predictor.
+
+It is important to note that developers and users do not need to worry about chaining when
+implementing their algorithms, all this is handled automatically by FlinkML.
+
+### How to Implement a Pipeline Operator
+
+In order to support FlinkML's pipelining, algorithms have to adhere to a certain design pattern, which we will describe in this section.
+Let's assume that we want to implement a pipeline operator which changes the mean of your data.
+Since centering data is a common preprocessing step in many analysis pipelines, we will implement it as a `Transformer`.
+Therefore, we first create a `MeanTransformer` class which inherits from `Transformer`
+
+{% highlight scala %}
+class MeanTransformer extends Transformer[MeanTransformer] {}
+{% endhighlight %}
+
+Since we want to be able to configure the mean of the resulting data, we have to add a configuration parameter.
+
+{% highlight scala %}
+class MeanTransformer extends Transformer[MeanTransformer] {
+ def setMean(mean: Double): this.type = {
+ parameters.add(MeanTransformer.Mean, mean)
+ this
+ }
+}
+
+object MeanTransformer {
+ case object Mean extends Parameter[Double] {
+ override val defaultValue: Option[Double] = Some(0.0)
+ }
+
+ def apply(): MeanTransformer = new MeanTransformer
+}
+{% endhighlight %}
+
+Parameters are defined in the companion object of the transformer class and extend the `Parameter` class.
+Since the parameter instances are supposed to act as immutable keys for a parameter map, they should be implemented as `case objects`.
+The default value will be used if no other value has been set by the user of this component.
+If no default value has been specified, meaning that `defaultValue = None`, then the algorithm has to handle this situation accordingly.
+
+We can now instantiate a `MeanTransformer` object and set the mean value of the transformed data.
+But we still have to implement how the transformation works.
+The workflow can be separated into two phases.
+Within the first phase, the transformer learns the mean of the given training data.
+This knowledge can then be used in the second phase to transform the provided data with respect to the configured resulting mean value.
+
+The learning of the mean can be implemented within the `fit` operation of our `Transformer`, which it inherited from `Estimator`.
+Within the `fit` operation, a pipeline component is trained with respect to the given training data.
+The algorithm is, however, **not** implemented by overriding the `fit` method but by providing an implementation of a corresponding `FitOperation` for the correct type.
+Taking a look at the definition of the `fit` method in `Estimator`, which is the parent class of `Transformer`, reveals what why this is the case.
+
+{% highlight scala %}
+trait Estimator[Self] extends WithParameters with Serializable {
+ that: Self =>
+
+ def fit[Training](
+ training: DataSet[Training],
+ fitParameters: ParameterMap = ParameterMap.Empty)
+ (implicit fitOperation: FitOperation[Self, Training]): Unit = {
+ FlinkMLTools.registerFlinkMLTypes(training.getExecutionEnvironment)
+ fitOperation.fit(this, fitParameters, training)
+ }
+}
+{% endhighlight %}
+
+We see that the `fit` method is called with an input data set of type `Training`, an optional parameter list and in the second parameter list with an implicit parameter of type `FitOperation`.
+Within the body of the function, first some machine learning types are registered and then the `fit` method of the `FitOperation` parameter is called.
+The instance gives itself, the parameter map and the training data set as a parameters to the method.
+Thus, all the program logic takes place within the `FitOperation`.
+
+The `FitOperation` has two type parameters.
+The first defines the pipeline operator type for which this `FitOperation` shall work and the second type parameter defines the type of the data set elements.
+If we first wanted to implement the `MeanTransformer` to work on `DenseVector`, we would, thus, have to provide an implementation for `FitOperation[MeanTransformer, DenseVector]`.
+
+{% highlight scala %}
+val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
+ override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
+ import org.apache.flink.ml.math.Breeze._
+ val meanTrainingData: DataSet[DenseVector] = input
+ .map{ x => (x.asBreeze, 1) }
+ .reduce{
+ (left, right) =>
+ (left._1 + right._1, left._2 + right._2)
+ }
+ .map{ p => (p._1/p._2).fromBreeze }
+ }
+}
+{% endhighlight %}
+
+A `FitOperation[T, I]` has a `fit` method which is called with an instance of type `T`, a parameter map and an input `DataSet[I]`.
+In our case `T=MeanTransformer` and `I=DenseVector`.
+The parameter map is necessary if our fit step depends on some parameter values which were not given directly at creation time of the `Transformer`.
+The `FitOperation` of the `MeanTransformer` sums the `DenseVector` instances of the given input data set up and divides the result by the total number of vectors.
+That way, we obtain a `DataSet[DenseVector]` with a single element which is the mean value.
+
+But if we look closely at the implementation, we see that the result of the mean computation is never stored anywhere.
+If we want to use this knowledge in a later step to adjust the mean of some other input, we have to keep it around.
+And here is where the parameter of type `MeanTransformer` which is given to the `fit` method comes into play.
+We can use this instance to store state, which is used by a subsequent `transform` operation which works on the same object.
+But first we have to extend `MeanTransformer` by a member field and then adjust the `FitOperation` implementation.
+
+{% highlight scala %}
+class MeanTransformer extends Transformer[Centering] {
+ var meanOption: Option[DataSet[DenseVector]] = None
+
+ def setMean(mean: Double): Mean = {
+ parameters.add(MeanTransformer.Mean, mu)
+ }
+}
+
+val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
+ override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
+ import org.apache.flink.ml.math.Breeze._
+
+ instance.meanOption = Some(input
+ .map{ x => (x.asBreeze, 1) }
+ .reduce{
+ (left, right) =>
+ (left._1 + right._1, left._2 + right._2)
+ }
+ .map{ p => (p._1/p._2).fromBreeze })
+ }
+}
+{% endhighlight %}
+
+If we look at the `transform` method in `Transformer`, we will see that we also need an implementation of `TransformOperation`.
+A possible mean transforming implementation could look like the following.
+
+{% highlight scala %}
+
+val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] {
+ override def transform(
+ instance: MeanTransformer,
+ transformParameters: ParameterMap,
+ input: DataSet[DenseVector])
+ : DataSet[DenseVector] = {
+ val resultingParameters = parameters ++ transformParameters
+
+ val resultingMean = resultingParameters(MeanTransformer.Mean)
+
+ instance.meanOption match {
+ case Some(trainingMean) => {
+ input.map{ new MeanTransformMapper(resultingMean) }.withBroadcastSet(trainingMean, "trainingMean")
+ }
+ case None => throw new RuntimeException("MeanTransformer has not been fitted to data.")
+ }
+ }
+}
+
+class MeanTransformMapper(resultingMean: Double) extends RichMapFunction[DenseVector, DenseVector] {
+ var trainingMean: DenseVector = null
+
+ override def open(parameters: Configuration): Unit = {
+ trainingMean = getRuntimeContext().getBroadcastVariable[DenseVector]("trainingMean").get(0)
+ }
+
+ override def map(vector: DenseVector): DenseVector = {
+ import org.apache.flink.ml.math.Breeze._
+
+ val result = vector.asBreeze  trainingMean.asBreeze + resultingMean
+
+ result.fromBreeze
+ }
+}
+{% endhighlight %}
+
+Now we have everything implemented to fit our `MeanTransformer` to a training data set of `DenseVector` instances and to transform them.
+However, when we execute the `fit` operation
+
+{% highlight scala %}
+val trainingData: DataSet[DenseVector] = ...
+val meanTransformer = MeanTransformer()
+
+meanTransformer.fit(trainingData)
+{% endhighlight %}
+
+we receive the following error at runtime: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.math.DenseVector]"`.
+The reason is that the Scala compiler could not find a fitting `FitOperation` value with the right type parameters for the implicit parameter of the `fit` method.
+Therefore, it chose a fallback implicit value which gives you this error message at runtime.
+In order to make the compiler aware of our implementation, we have to define it as an implicit value and put it in the scope of the `MeanTransformer's` companion object.
+
+{% highlight scala %}
+object MeanTransformer{
+ implicit val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] ...
+
+ implicit val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] ...
+}
+{% endhighlight %}
+
+Now we can call `fit` and `transform` of our `MeanTransformer` with `DataSet[DenseVector]` as input.
+Furthermore, we can now use this transformer as part of an analysis pipeline where we have a `DenseVector` as input and expected output.
+
+{% highlight scala %}
+val trainingData: DataSet[DenseVector] = ...
+
+val mean = MeanTransformer.setMean(1.0)
+val polyFeatures = PolynomialFeatures().setDegree(3)
+
+val pipeline = mean.chainTransformer(polyFeatures)
+
+pipeline.fit(trainingData)
+{% endhighlight %}
+
+It is noteworthy that there is no additional code needed to enable chaining.
+The system automatically constructs the pipeline logic using the operations of the individual components.
+
+So far everything works fine with `DenseVector`.
+But what happens, if we call our transformer with `LabeledVector` instead?
+{% highlight scala %}
+val trainingData: DataSet[LabeledVector] = ...
+
+val mean = MeanTransformer()
+
+mean.fit(trainingData)
+{% endhighlight %}
+
+As before we see the following exception upon execution of the program: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.common.LabeledVector]"`.
+It is noteworthy, that this exception is thrown in the preflight phase, which means that the job has not been submitted to the runtime system.
+This has the advantage that you won't see a job which runs for a couple of days and then fails because of an incompatible pipeline component.
+Type compatibility is, thus, checked at the very beginning for the complete job.
+
+In order to make the `MeanTransformer` work on `LabeledVector` as well, we have to provide the corresponding operations.
+Consequently, we have to define a `FitOperation[MeanTransformer, LabeledVector]` and `TransformOperation[MeanTransformer, LabeledVector, LabeledVector]` as implicit values in the scope of `MeanTransformer`'s companion object.
+
+{% highlight scala %}
+object MeanTransformer {
+ implicit val labeledVectorFitOperation = new FitOperation[MeanTransformer, LabeledVector] ...
+
+ implicit val labeledVectorTransformOperation = new TransformOperation[MeanTransformer, LabeledVector, LabeledVector] ...
+}
+{% endhighlight %}
+
+If we wanted to implement a `Predictor` instead of a `Transformer`, then we would have to provide a `FitOperation`, too.
+Moreover, a `Predictor` requires a `PredictOperation` which implements how predictions are calculated from testing data.
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/polynomial_features.md

diff git a/docs/dev/libs/ml/polynomial_features.md b/docs/dev/libs/ml/polynomial_features.md
new file mode 100644
index 0000000..676c132
 /dev/null
+++ b/docs/dev/libs/ml/polynomial_features.md
@@ 0,0 +1,108 @@
+
+mathjax: include
+title: Polynomial Features
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+The polynomial features transformer maps a vector into the polynomial feature space of degree $d$.
+The dimension of the input vector determines the number of polynomial factors whose values are the respective vector entries.
+Given a vector $(x, y, z, \ldots)^T$ the resulting feature vector looks like:
+
+$$\left(x, y, z, x^2, xy, y^2, yz, z^2, x^3, x^2y, x^2z, xy^2, xyz, xz^2, y^3, \ldots\right)^T$$
+
+Flink's implementation orders the polynomials in decreasing order of their degree.
+
+Given the vector $\left(3,2\right)^T$, the polynomial features vector of degree 3 would look like
+
+ $$\left(3^3, 3^2\cdot2, 3\cdot2^2, 2^3, 3^2, 3\cdot2, 2^2, 3, 2\right)^T$$
+
+This transformer can be prepended to all `Transformer` and `Predictor` implementations which expect an input of type `LabeledVector` or any subtype of `Vector`.
+
+## Operations
+
+`PolynomialFeatures` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+PolynomialFeatures is not trained on data and, thus, supports all types of input data.
+
+### Transform
+
+PolynomialFeatures transforms all subtypes of `Vector` and `LabeledVector` into their respective types:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The polynomial features transformer can be controlled by the following parameters:
+
+<table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameters</th>
+ <th class="textcenter">Description</th>
+ </tr>
+ </thead>
+
+ <tbody>
+ <tr>
+ <td><strong>Degree</strong></td>
+ <td>
+ <p>
+ The maximum polynomial degree.
+ (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ </tbody>
+ </table>
+
+## Examples
+
+{% highlight scala %}
+// Obtain the training data set
+val trainingDS: DataSet[LabeledVector] = ...
+
+// Setup polynomial feature transformer of degree 3
+val polyFeatures = PolynomialFeatures()
+.setDegree(3)
+
+// Setup the multiple linear regression learner
+val mlr = MultipleLinearRegression()
+
+// Control the learner via the parameter map
+val parameters = ParameterMap()
+.add(MultipleLinearRegression.Iterations, 20)
+.add(MultipleLinearRegression.Stepsize, 0.5)
+
+// Create pipeline PolynomialFeatures > MultipleLinearRegression
+val pipeline = polyFeatures.chainPredictor(mlr)
+
+// train the model
+pipeline.fit(trainingDS)
+{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/quickstart.md

diff git a/docs/dev/libs/ml/quickstart.md b/docs/dev/libs/ml/quickstart.md
new file mode 100644
index 0000000..26b9275
 /dev/null
+++ b/docs/dev/libs/ml/quickstart.md
@@ 0,0 +1,243 @@
+
+mathjax: include
+title: Quickstart Guide
+navtitle: Quickstart
+navparent_id: ml
+navpos: 0
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Introduction
+
+FlinkML is designed to make learning from your data a straightforward process, abstracting away
+the complexities that usually come with big data learning tasks. In this
+quickstart guide we will show just how easy it is to solve a simple supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [[1]](#murphy) ML deals with detecting patterns in data, and using those
+learned patterns to make predictions about the future. We can categorize most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
+(features) to a set of outputs. The learning is done using a *training set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised learning problems are
+further divided into classification and regression problems. In classification problems we try to
+predict the *class* that an example belongs to, for example whether a user is going to click on
+an ad or not. Regression problems one the other hand, are about predicting (real) numerical
+values, often called the dependent variable, for example what the temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the data from the
+descriptive features. Unsupervised learning can also be used for feature selection, for example
+through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in your project, first you have to
+[set up a Flink program]({{ site.baseurl }}}/dev/api_concepts.html#linkingwithflink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
+
+{% highlight xml %}
+<dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flinkml{{ site.scala_version_suffix }}</artifactId>
+ <version>{{site.version }}</version>
+</dependency>
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised learning problems it is
+common to use the `LabeledVector` class to represent the `(label, features)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of the example and a `Double`
+member which represents the label, which could be the class in a classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machinelearningdatabases/haberman/haberman.data).
+This dataset *"contains cases from a study conducted on the survival of patients who had undergone
+surgery for breast cancer"*. The data comes in a commaseparated file, where the first 3 columns
+are the features and last column is the class, and the 4th column indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
+is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+ .map{tuple =>
+ val list = tuple.productIterator.toList
+ val numList = list.map(_.asInstanceOf[String].toDouble)
+ LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
+ }
+
+{% endhighlight %}
+
+We can then use this data to train a learner. We will however use another dataset to exemplify
+building a learner; that will allow us to show how we can import other dataset formats.
+
+**LibSVM files**
+
+A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
+found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function available through the `MLUtils`
+object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` function.
+Let's import the svmguide1 dataset. You can download the
+[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
+and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
+This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
+practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.MLUtils
+
+val astroTrain: DataSet[LabeledVector] = MLUtils.readLibSVM("/path/to/svmguide1")
+val astroTest: DataSet[LabeledVector] = MLUtils.readLibSVM("/path/to/svmguide1.t")
+
+{% endhighlight %}
+
+This gives us two `DataSet[LabeledVector]` objects that we will use in the following section to
+create a classifier.
+
+## Classification
+
+Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
+We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
+which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
+regularization parameter determines the amount of $l_2$ regularization applied, which is used
+to avoid overfitting. The step size determines the contribution of the weight vector updates to
+the next weight vector value. This parameter sets the initial step size.
+
+{% highlight scala %}
+
+import org.apache.flink.ml.classification.SVM
+
+val svm = SVM()
+ .setBlocks(env.getParallelism)
+ .setIterations(100)
+ .setRegularization(0.001)
+ .setStepsize(0.1)
+ .setSeed(42)
+
+svm.fit(astroTrain)
+
+{% endhighlight %}
+
+We can now make predictions on the test set.
+
+{% highlight scala %}
+
+val predictionPairs = svm.predict(astroTest)
+
+{% endhighlight %}
+
+Next we will see how we can preprocess our data, and use the ML pipelines capabilities of FlinkML.
+
+## Data preprocessing and pipelines
+
+A preprocessing step that is often encouraged [[3]](#hsu) when using SVM classification is scaling
+the input features to the [0, 1] range, in order to avoid features with extreme values
+dominating the rest.
+FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to preprocess data,
+and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows
+us to run the same pipeline of transformations and make predictions on the train and test data in
+a straightforward and typesafe manner. You can read more on the pipeline system of FlinkML
+[in the pipelines documentation](pipelines.html).
+
+Let us first create a normalizing transformer for the features in our dataset, and chain it to a
+new SVM classifier.
+
+{% highlight scala %}
+
+import org.apache.flink.ml.preprocessing.MinMaxScaler
+
+val scaler = MinMaxScaler()
+
+val scaledSVM = scaler.chainPredictor(svm)
+
+{% endhighlight %}
+
+We can now use our newly created pipeline to make predictions on the test set.
+First we call fit again, to train the scaler and the SVM classifier.
+The data of the test set will then be automatically scaled before being passed on to the SVM to
+make predictions.
+
+{% highlight scala %}
+
+scaledSVM.fit(astroTrain)
+
+val predictionPairsScaled: DataSet[(Double, Double)] = scaledSVM.predict(astroTest)
+
+{% endhighlight %}
+
+The scaled inputs should give us better prediction performance.
+The result of the prediction on `LabeledVector`s is a data set of tuples where the first entry denotes the true label value and the second entry is the predicted label value.
+
+## Where to go from here
+
+This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot
+more you can do.
+We recommend going through the [FlinkML documentation]({{ site.baseurl }}/dev/libs/ml/index.html), and trying out the different
+algorithms.
+A very good way to get started is to play around with interesting datasets from the UCI ML
+repository and the LibSVM datasets.
+Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or
+[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other
+data scientists.
+If you would like to contribute some new algorithms take a look at our
+[contribution guide](contribution_guide.html).
+
+**References**
+
+<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
+press, 2012.
+
+<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communicationefficient distributed dual
+coordinate ascent.* Advances in Neural Information Processing Systems. 2014.
+
+<a name="hsu"></a>[3] Hsu, ChihWei, ChihChung Chang, and ChihJen Lin.
+ *A practical guide to support vector classification.* 2003.
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/standard_scaler.md

diff git a/docs/dev/libs/ml/standard_scaler.md b/docs/dev/libs/ml/standard_scaler.md
new file mode 100644
index 0000000..5104d3c
 /dev/null
+++ b/docs/dev/libs/ml/standard_scaler.md
@@ 0,0 +1,113 @@
+
+mathjax: include
+title: Standard Scaler
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
+ In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1.
+ Given a set of input data $x_1, x_2,... x_n$, with mean:
+
+ $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
+
+ and standard deviation:
+
+ $$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}\bar{x})^{2}}$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= std \left (\frac{x_{i}  \bar{x} }{\sigma_{x}}\right ) + mean$$
+
+where $\textit{std}$ and $\textit{mean}$ are the user specified values for the standard deviation and mean.
+
+## Operations
+
+`StandardScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+StandardScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Transform
+
+StandardScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The standard scaler implementation can be controlled by the following two parameters:
+
+ <table class="table tablebordered">
+ <thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameters</th>
+ <th class="textcenter">Description</th>
+ </tr>
+ </thead>
+
+ <tbody>
+ <tr>
+ <td><strong>Mean</strong></td>
+ <td>
+ <p>
+ The mean of the scaled data set. (Default value: <strong>0.0</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Std</strong></td>
+ <td>
+ <p>
+ The standard deviation of the scaled data set. (Default value: <strong>1.0</strong>)
+ </p>
+ </td>
+ </tr>
+ </tbody>
+</table>
+
+## Examples
+
+{% highlight scala %}
+// Create standard scaler transformer
+val scaler = StandardScaler()
+.setMean(10.0)
+.setStd(2.0)
+
+// Obtain data set to be scaled
+val dataSet: DataSet[Vector] = ...
+
+// Learn the mean and standard deviation of the training data
+scaler.fit(dataSet)
+
+// Scale the provided data set to have mean=10.0 and std=2.0
+val scaledDS = scaler.transform(dataSet)
+{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/svm.md

diff git a/docs/dev/libs/ml/svm.md b/docs/dev/libs/ml/svm.md
new file mode 100644
index 0000000..34fa1ec
 /dev/null
+++ b/docs/dev/libs/ml/svm.md
@@ 0,0 +1,220 @@
+
+mathjax: include
+title: SVM using CoCoA
+navparent_id: ml
+
+<!
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+>
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+Implements an SVM with softmargin using the communicationefficient distributed dual coordinate
+ascent algorithm with hingeloss function.
+The algorithm solves the following minimization problem:
+
+$$\min_{\mathbf{w} \in \mathbb{R}^d} \frac{\lambda}{2} \left\lVert \mathbf{w} \right\rVert^2 + \frac{1}{n} \sum_{i=1}^n l_{i}\left(\mathbf{w}^T\mathbf{x}_i\right)$$
+
+with $\mathbf{w}$ being the weight vector, $\lambda$ being the regularization constant,
+$$\mathbf{x}_i \in \mathbb{R}^d$$ being the data points and $$l_{i}$$ being the convex loss
+functions, which can also depend on the labels $$y_{i} \in \mathbb{R}$$.
+In the current implementation the regularizer is the $\ell_2$norm and the loss functions are the hingeloss functions:
+
+ $$l_{i} = \max\left(0, 1  y_{i} \mathbf{w}^T\mathbf{x}_i \right)$$
+
+With these choices, the problem definition is equivalent to a SVM with softmargin.
+Thus, the algorithm allows us to train a SVM with softmargin.
+
+The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA).
+In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm calculates
+several iterations of SDCA locally on a data block before merging the local updates into a
+valid global state.
+This state is redistributed to the different data partitions where the next round of local SDCA
+iterations is then executed.
+The number of outer iterations and local SDCA iterations control the overall network costs, because
+there is only network communication required for each outer iteration.
+The local SDCA iterations are embarrassingly parallel once the individual data partitions have been
+distributed across the cluster.
+
+The implementation of this algorithm is based on the work of
+[Jaggi et al.](http://arxiv.org/abs/1409.1458)
+
+## Operations
+
+`SVM` is a `Predictor`.
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+SVM is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+SVM predicts for all subtypes of FlinkML's `Vector` the corresponding class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Double)]`, where the `(T, Double)` tuple
+ corresponds to (original_features, label)
+
+If we call evaluate with a `DataSet[(Vector, Double)]`, we make a prediction on the class label
+for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element
+is the true value, as was provided from the input `DataSet[(Vector, Double)]` and the second element
+is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate
+the algorithm's performance.
+
+* `predict: DataSet[(Vector, Double)] => DataSet[(Double, Double)]`
+
+## Parameters
+
+The SVM implementation can be controlled by the following parameters:
+
+<table class="table tablebordered">
+<thead>
+ <tr>
+ <th class="textleft" style="width: 20%">Parameters</th>
+ <th class="textcenter">Description</th>
+ </tr>
+</thead>
+
+<tbody>
+ <tr>
+ <td><strong>Blocks</strong></td>
+ <td>
+ <p>
+ Sets the number of blocks into which the input data will be split.
+ On each block the local stochastic dual coordinate ascent method is executed.
+ This number should be set at least to the degree of parallelism.
+ If no value is specified, then the parallelism of the input DataSet is used as the number of blocks.
+ (Default value: <strong>None</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Iterations</strong></td>
+ <td>
+ <p>
+ Defines the maximum number of iterations of the outer loop method.
+ In other words, it defines how often the SDCA method is applied to the blocked data.
+ After each iteration, the locally computed weight vector updates have to be reduced to update the global weight vector value.
+ The new weight vector is broadcast to all SDCA tasks at the beginning of each iteration.
+ (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>LocalIterations</strong></td>
+ <td>
+ <p>
+ Defines the maximum number of SDCA iterations.
+ In other words, it defines how many data points are drawn from each local data block to calculate the stochastic dual coordinate ascent.
+ (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Regularization</strong></td>
+ <td>
+ <p>
+ Defines the regularization constant of the SVM algorithm.
+ The higher the value, the smaller will the 2norm of the weight vector be.
+ In case of a SVM with hinge loss this means that the SVM margin will be wider even though it might contain some false classifications.
+ (Default value: <strong>1.0</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Stepsize</strong></td>
+ <td>
+ <p>
+ Defines the initial step size for the updates of the weight vector.
+ The larger the step size is, the larger will be the contribution of the weight vector updates to the next weight vector value.
+ The effective scaling of the updates is $\frac{stepsize}{blocks}$.
+ This value has to be tuned in case that the algorithm becomes unstable.
+ (Default value: <strong>1.0</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>ThresholdValue</strong></td>
+ <td>
+ <p>
+ Defines the limiting value for the decision function above which examples are labeled as
+ positive (+1.0). Examples with a decision function value below this value are classified
+ as negative (1.0). In order to get the raw decision function values you need to indicate it by
+ using the OutputDecisionFunction parameter. (Default value: <strong>0.0</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>OutputDecisionFunction</strong></td>
+ <td>
+ <p>
+ Determines whether the predict and evaluate functions of the SVM should return the distance
+ to the separating hyperplane, or binary class labels. Setting this to true will
+ return the raw distance to the hyperplane for each example. Setting it to false will
+ return the binary class label (+1.0, 1.0) (Default value: <strong>false</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Seed</strong></td>
+ <td>
+ <p>
+ Defines the seed to initialize the random number generator.
+ The seed directly controls which data points are chosen for the SDCA method.
+ (Default value: <strong>Random Long Integer</strong>)
+ </p>
+ </td>
+</tr>
+</tbody>
+</table>
+
+## Examples
+
+{% highlight scala %}
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.classification.SVM
+import org.apache.flink.ml.RichExecutionEnvironment
+
+val pathToTrainingFile: String = ???
+val pathToTestingFile: String = ???
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+// Read the training data set, from a LibSVM formatted file
+val trainingDS: DataSet[LabeledVector] = env.readLibSVM(pathToTrainingFile)
+
+// Create the SVM learner
+val svm = SVM()
+ .setBlocks(10)
+
+// Learn the SVM model
+svm.fit(trainingDS)
+
+// Read the testing data set
+val testingDS: DataSet[Vector] = env.readLibSVM(pathToTestingFile).map(_.vector)
+
+// Calculate the predictions for the testing data set
+val predictionDS: DataSet[(Vector, Double)] = svm.predict(testingDS)
+
+{% endhighlight %}
