##### Site index · List index
Message view
Top
From se...@apache.org
Subject [13/89] [abbrv] [partial] flink git commit: [FLINK-4317, FLIP-3] [docs] Restructure docs
Date Thu, 25 Aug 2016 18:48:16 GMT
http://git-wip-us.apache.org/repos/asf/flink/blob/844c874b/docs/dev/libs/ml/index.md
----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/index.md b/docs/dev/libs/ml/index.md
new file mode 100644
index 0000000..d01e18e
--- /dev/null
+++ b/docs/dev/libs/ml/index.md
@@ -0,0 +1,144 @@
+---
+nav-id: ml
+nav-show_overview: true
+nav-title: Machine Learning
+nav-parent_id: libs
+nav-pos: 4
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
+with a growing list of algorithms and contributors. With FlinkML we aim to provide
+scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML
+systems. You can see more details about our goals and where the library is headed in our [vision
+
+* This will be replaced by the TOC
+{:toc}
+
+## Supported Algorithms
+
+FlinkML currently supports the following algorithms:
+
+### Supervised Learning
+
+* [SVM using Communication efficient distributed dual coordinate ascent (CoCoA)](svm.html)
+* [Multiple linear regression](multiple_linear_regression.html)
+* [Optimization Framework](optimization.html)
+
+### Unsupervised Learning
+
+* [k-Nearest neighbors join](knn.html)
+
+### Data Preprocessing
+
+* [Polynomial Features](polynomial_features.html)
+* [Standard Scaler](standard_scaler.html)
+* [MinMax Scaler](min_max_scaler.html)
+
+### Recommendation
+
+* [Alternating Least Squares (ALS)](als.html)
+
+### Utilities
+
+* [Distance Metrics](distance_metrics.html)
+* [Cross Validation](cross_validation.html)
+
+## Getting Started
+
+You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started
+example.
+
+If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/dev/api_concepts.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the pom.xml of your project.
+
+{% highlight xml %}
+<dependency>
+  <version>{{site.version }}</version>
+</dependency>
+{% endhighlight %}
+
+Note that FlinkML is currently not part of the binary distribution.
+
+The following code snippet shows how easy it is to train a multiple linear regression model.
+
+{% highlight scala %}
+// LabeledVector is a feature vector with a label (class or real value)
+val trainingData: DataSet[LabeledVector] = ...
+val testingData: DataSet[Vector] = ...
+
+// Alternatively, a Splitter is used to break up a DataSet into training and testing data.
+val dataSet: DataSet[LabeledVector] = ...
+val trainTestData: DataSet[TrainTestDataSet] = Splitter.trainTestSplit(dataSet)
+val trainingData: DataSet[LabeledVector] = trainTestData.training
+val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)
+
+val mlr = MultipleLinearRegression()
+  .setStepsize(1.0)
+  .setIterations(100)
+  .setConvergenceThreshold(0.001)
+
+mlr.fit(trainingData)
+
+// The fitted model can now be used to make predictions
+val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
+{% endhighlight %}
+
+## Pipelines
+
+A key concept of FlinkML is its [scikit-learn](http://scikit-learn.org) inspired pipelining mechanism.
+It allows you to quickly build complex data analysis pipelines how they appear in every data scientist's daily work.
+An in-depth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html).
+
+The following example code shows how easy it is to set up an analysis pipeline with FlinkML.
+
+{% highlight scala %}
+val trainingData: DataSet[LabeledVector] = ...
+val testingData: DataSet[Vector] = ...
+
+val scaler = StandardScaler()
+val polyFeatures = PolynomialFeatures().setDegree(3)
+val mlr = MultipleLinearRegression()
+
+// Construct pipeline of standard scaler, polynomial features and multiple linear regression
+val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
+
+// Train pipeline
+pipeline.fit(trainingData)
+
+// Calculate predictions
+val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
+{% endhighlight %}
+
+One can chain a Transformer to another Transformer or a set of chained Transformers by calling the method chainTransformer.
+If one wants to chain a Predictor to a Transformer or a set of chained Transformers, one has to call the method chainPredictor.
+
+
+## How to contribute
+
+The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries.
+[contribution guide]({{site.baseurl}}/dev/libs/ml/contribution_guide.html).

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/knn.md b/docs/dev/libs/ml/knn.md
new file mode 100644
index 0000000..0d3ca9a
--- /dev/null
+++ b/docs/dev/libs/ml/knn.md
@@ -0,0 +1,144 @@
+---
+mathjax: include
+title: k-Nearest Neighbors Join
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors join algorithm.  Given a training set $A$ and a testing set $B$, the algorithm returns
+
+$$+KNNJ(A, B, k) = \{ \left( b, KNN(b, A, k) \right) \text{ where } b \in B \text{ and } KNN(b, A, k) \text{ are the k-nearest points to }b\text{ in }A \} +$$
+
+The brute-force approach is to compute the distance between every training and testing point. To ease the brute-force computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.
+
+## Operations
+
+KNN is a Predictor.
+As such, it supports the fit and predict operation.
+
+### Fit
+
+KNN is trained by a given set of Vector:
+
+* fit[T <: Vector]: DataSet[T] => Unit
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's Vector the corresponding class label:
+
+* predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])], where the (T, Array[Vector]) tuple
+  corresponds to (test point, K-nearest training points)
+
+## Parameters
+
+The KNN implementation can be controlled by the following parameters:
+
+   <table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Parameters</th>
+        <th class="text-center">Description</th>
+      </tr>
+
+    <tbody>
+      <tr>
+        <td><strong>K</strong></td>
+        <td>
+          <p>
+            Defines the number of nearest-neighbors to search for. That is, for each test point, the algorithm finds the K-nearest neighbors in the training set
+            (Default value: <strong>5</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>DistanceMetric</strong></td>
+        <td>
+          <p>
+            Sets the distance metric we use to calculate the distance between two points. If no metric is specified, then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+            (Default value: <strong>EuclideanDistanceMetric</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Blocks</strong></td>
+        <td>
+          <p>
+            Sets the number of blocks into which the input data will be split. This number should be set
+            at least to the degree of parallelism. If no value is specified, then the parallelism of the
+            input [[DataSet]] is used as the number of blocks.
+            (Default value: <strong>None</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td>
+          <p>
+            A boolean variable that whether or not to use a quadtree to partition the training set to potentially simplify the KNN search. If no value is specified, the code will automatically decide whether or not to use a quadtree. Use of a quadtree scales well with the number of training and testing points, though poorly with the dimension.
+            (Default value: <strong>None</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>SizeHint</strong></td>
+        <td>
+          <p>Specifies whether the training set or test set is small to optimize the cross product operation needed for the KNN search. If the training set is small this should be CrossHint.FIRST_IS_SMALL and set to CrossHint.SECOND_IS_SMALL if the test set is small.
+             (Default value: <strong>None</strong>)
+          </p>
+        </td>
+      </tr>
+    </tbody>
+  </table>
+
+## Examples
+
+{% highlight scala %}
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+// prepare data
+val trainingSet: DataSet[Vector] = ...
+val testingSet: DataSet[Vector] = ...
+
+val knn = KNN()
+  .setK(3)
+  .setBlocks(10)
+  .setDistanceMetric(SquaredEuclideanDistanceMetric())
+  .setSizeHint(CrossHint.SECOND_IS_SMALL)
+
+// run knn join
+knn.fit(trainingSet)
+val result = knn.predict(testingSet).collect()
+{% endhighlight %}
+
+For more details on the computing KNN with and without and quadtree, here is a presentation: [http://danielblazevski.github.io/](http://danielblazevski.github.io/)

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/min_max_scaler.md b/docs/dev/libs/ml/min_max_scaler.md
new file mode 100644
index 0000000..35376c3
--- /dev/null
+++ b/docs/dev/libs/ml/min_max_scaler.md
@@ -0,0 +1,112 @@
+---
+mathjax: include
+title: MinMax Scaler
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
+
+## Operations
+
+MinMaxScaler is a Transformer.
+As such, it supports the fit and transform operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of Vector or LabeledVector:
+
+* fit[T <: Vector]: DataSet[T] => Unit
+* fit: DataSet[LabeledVector] => Unit
+
+### Transform
+
+MinMaxScaler transforms all subtypes of Vector or LabeledVector into the respective type:
+
+* transform[T <: Vector]: DataSet[T] => DataSet[T]
+* transform: DataSet[LabeledVector] => DataSet[LabeledVector]
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two parameters:
+
+ <table class="table table-bordered">
+    <tr>
+      <th class="text-left" style="width: 20%">Parameters</th>
+      <th class="text-center">Description</th>
+    </tr>
+
+  <tbody>
+    <tr>
+      <td><strong>Min</strong></td>
+      <td>
+        <p>
+          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
+        </p>
+      </td>
+    </tr>
+    <tr>
+      <td><strong>Max</strong></td>
+      <td>
+        <p>
+          The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
+        </p>
+      </td>
+    </tr>
+  </tbody>
+</table>
+
+## Examples
+
+{% highlight scala %}
+// Create MinMax scaler transformer
+val minMaxscaler = MinMaxScaler()
+  .setMin(-1.0)
+
+// Obtain data set to be scaled
+val dataSet: DataSet[Vector] = ...
+
+// Learn the minimum and maximum values of the training data
+minMaxscaler.fit(dataSet)
+
+// Scale the provided data set to have min=-1.0 and max=1.0
+val scaledDS = minMaxscaler.transform(dataSet)
+{% endhighlight %}

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/multiple_linear_regression.md b/docs/dev/libs/ml/multiple_linear_regression.md
new file mode 100644
index 0000000..95ee85f
--- /dev/null
+++ b/docs/dev/libs/ml/multiple_linear_regression.md
@@ -0,0 +1,160 @@
+---
+mathjax: include
+title: Multiple Linear Regression
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ Multiple linear regression tries to find a linear function which best fits the provided input data.
+ Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds
+ a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized:
+
+ $$S(\mathbf{w}) = \sum_{i=1} \left(y - \mathbf{w}^T\mathbf{x_i} \right)^2$$
+
+ Written in matrix notation, we obtain the following formulation:
+
+ $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y} - X\mathbf{w})^2$$
+
+ This problem has a closed form solution which is given by:
+
+  $$\mathbf{w}^* = \left(X^TX\right)^{-1}X^T\mathbf{y}$$
+
+  However, in cases where the input data set is so huge that a complete parse over the whole data
+  set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution.
+  SGD first calculates for a random subset of the input data set the gradients. The gradient
+  for a given point $\mathbf{x}_i$ is given by:
+
+  $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} - + y\right)\mathbf{x_i}$$
+
+  The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$
+  with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the
+  current weight vector giving the new weight vector for the next iteration:
+
+  $$\mathbf{w}_{t+1} = \mathbf{w}_t - \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$
+
+  The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion.
+  The convergence criterion is the relative change in the sum of squared residuals:
+
+  $$\frac{S_{k-1} - S_k}{S_{k-1}} < \rho$$
+
+## Operations
+
+MultipleLinearRegression is a Predictor.
+As such, it supports the fit and predict operation.
+
+### Fit
+
+MultipleLinearRegression is trained on a set of LabeledVector:
+
+* fit: DataSet[LabeledVector] => Unit
+
+### Predict
+
+MultipleLinearRegression predicts for all subtypes of Vector the corresponding regression value:
+
+* predict[T <: Vector]: DataSet[T] => DataSet[LabeledVector]
+
+If we call predict with a DataSet[LabeledVector], we make a prediction on the regression value
+for each example, and return a DataSet[(Double, Double)]. In each tuple the first element
+is the true value, as was provided from the input DataSet[LabeledVector] and the second element
+is the predicted value. You can then use these (truth, prediction) tuples to evaluate
+the algorithm's performance.
+
+* predict: DataSet[LabeledVector] => DataSet[(Double, Double)]
+
+## Parameters
+
+  The multiple linear regression implementation can be controlled by the following parameters:
+
+   <table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Parameters</th>
+        <th class="text-center">Description</th>
+      </tr>
+
+    <tbody>
+      <tr>
+        <td><strong>Iterations</strong></td>
+        <td>
+          <p>
+            The maximum number of iterations. (Default value: <strong>10</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Stepsize</strong></td>
+        <td>
+          <p>
+            Initial step size for the gradient descent method.
+            This value controls how far the gradient descent method moves in the opposite direction of the gradient.
+            Tuning this parameter might be crucial to make it stable and to obtain a better performance.
+            (Default value: <strong>0.1</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>ConvergenceThreshold</strong></td>
+        <td>
+          <p>
+            Threshold for relative change of the sum of squared residuals until the iteration is stopped.
+            (Default value: <strong>None</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>LearningRateMethod</strong></td>
+        <td>
+            <p>
+                Learning rate method used to calculate the effective learning rate for each iteration.
+                See the list of supported <a href="optimization.html">learning rate methods</a>.
+                (Default value: <strong>LearningRateMethod.Default</strong>)
+            </p>
+        </td>
+      </tr>
+    </tbody>
+  </table>
+
+## Examples
+
+{% highlight scala %}
+// Create multiple linear regression learner
+val mlr = MultipleLinearRegression()
+.setIterations(10)
+.setStepsize(0.5)
+.setConvergenceThreshold(0.001)
+
+// Obtain training and testing data set
+val trainingDS: DataSet[LabeledVector] = ...
+val testingDS: DataSet[Vector] = ...
+
+// Fit the linear model to the provided data
+mlr.fit(trainingDS)
+
+// Calculate the predictions for the test data
+val predictions = mlr.predict(testingDS)
+{% endhighlight %}

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/optimization.md b/docs/dev/libs/ml/optimization.md
new file mode 100644
index 0000000..e3e2f63
--- /dev/null
+++ b/docs/dev/libs/ml/optimization.md
@@ -0,0 +1,382 @@
+---
+mathjax: include
+title: Optimization
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+{:toc}
+
+## Mathematical Formulation
+
+The optimization framework in FlinkML is a developer-oriented package that can be used to solve
+[optimization](https://en.wikipedia.org/wiki/Mathematical_optimization)
+problems common in Machine Learning (ML) tasks. In the supervised learning context, this usually
+involves finding a model, as defined by a set of parameters $w$, that minimize a function $f(\wv)$
+given a set of $(\x, y)$ examples,
+where $\x$ is a feature vector and $y$ is a real number, which can represent either a real value in
+the regression case, or a class label in the classification case. In supervised learning, the
+function to be minimized is usually of the form:
+
+
+$$\label{eq:objectiveFunc} + f(\wv) := + \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) + + \lambda\, R(\wv) + \ . +$$
+
+
+where $L$ is the loss function and $R(\wv)$ the regularization penalty. We use $L$ to measure how
+well the model fits the observed data, and we use $R$ in order to impose a complexity cost to the
+model, with $\lambda > 0$ being the regularization parameter.
+
+### Loss Functions
+
+In supervised learning, we use loss functions in order to measure the model fit, by
+penalizing errors in the predictions $p$ made by the model compared to the true $y$ for each
+example. Different loss functions can be used for regression (e.g. Squared Loss) and classification
+
+Some common loss functions are:
+
+* Squared Loss: $\frac{1}{2} \left(\wv^T \cdot \x - y\right)^2, \quad y \in \R$
+* Hinge Loss: $\max \left(0, 1 - y ~ \wv^T \cdot \x\right), \quad y \in \{-1, +1\}$
+* Logistic Loss: $\log\left(1+\exp\left( -y ~ \wv^T \cdot \x\right)\right), \quad y \in \{-1, +1\}$
+
+### Regularization Types
+
+[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) in machine learning
+imposes penalties to the estimated models, in order to reduce overfitting. The most common penalties
+are the $L_1$ and $L_2$ penalties, defined as:
+
+* $L_1$: $R(\wv) = \norm{\wv}_1$
+* $L_2$: $R(\wv) = \frac{1}{2}\norm{\wv}_2^2$
+
+The $L_2$ penalty penalizes large weights, favoring solutions with more small weights rather than
+few large ones.
+The $L_1$ penalty can be used to drive a number of the solution coefficients to 0, thereby
+producing sparse solutions.
+The regularization constant $\lambda$ in $\eqref{eq:objectiveFunc}$ determines the amount of regularization applied to the model,
+and is usually determined through model cross-validation.
+A good comparison of regularization types can be found in [this](http://www.robotics.stanford.edu/~ang/papers/icml04-l1l2.pdf) paper by Andrew Ng.
+Which regularization type is supported depends on the actually used optimization algorithm.
+
+
+In order to find a (local) minimum of a function, Gradient Descent methods take steps in the
+direction opposite to the gradient of the function $\eqref{eq:objectiveFunc}$ taken with
+respect to the current parameters (weights).
+In order to compute the exact gradient we need to perform one pass through all the points in
+a dataset, making the process computationally expensive.
+An alternative is Stochastic Gradient Descent (SGD) where at each iteration we sample one point
+from the complete dataset and update the parameters for each point, in an online manner.
+
+In mini-batch SGD we instead sample random subsets of the dataset, and compute the gradient
+over each batch. At each iteration of the algorithm we update the weights once, based on
+the average of the gradients computed from each mini-batch.
+
+An important parameter is the learning rate $\eta$, or step size, which can be determined by one of five methods, listed below. The setting of the initial step size can significantly affect the performance of the
+algorithm. For some practical tips on tuning SGD see Leon Botou's
+
+The current implementation of SGD  uses the whole partition, making it
+effectively a batch gradient descent. Once a sampling operator has been introduced in Flink, true
+mini-batch SGD will be performed.
+
+### Regularization
+
+The following list contains a mapping between the implementing classes and the regularization function.
+
+<table class="table table-bordered">
+    <tr>
+      <th class="text-left" style="width: 20%">Class Name</th>
+      <th class="text-center">Regularization function $R(\wv)$</th>
+    </tr>
+  <tbody>
+    <tr>
+      <td>$R(\wv) = 0$</td>
+    </tr>
+    <tr>
+      <td>$R(\wv) = \norm{\wv}_1$</td>
+    </tr>
+    <tr>
+      <td>$R(\wv) = \frac{1}{2}\norm{\wv}_2^2$</td>
+    </tr>
+  </tbody>
+</table>
+
+### Parameters
+
+  The stochastic gradient descent implementation can be controlled by the following parameters:
+
+   <table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Parameter</th>
+        <th class="text-center">Description</th>
+      </tr>
+    <tbody>
+      <tr>
+        <td><strong>LossFunction</strong></td>
+        <td>
+          <p>
+            The loss function to be optimized. (Default value: <strong>None</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>RegularizationConstant</strong></td>
+        <td>
+          <p>
+            The amount of regularization to apply. (Default value: <strong>0.1</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Iterations</strong></td>
+        <td>
+          <p>
+            The maximum number of iterations. (Default value: <strong>10</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>LearningRate</strong></td>
+        <td>
+          <p>
+            Initial learning rate for the gradient descent method.
+            This value controls how far the gradient descent method moves in the opposite direction
+            (Default value: <strong>0.1</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>ConvergenceThreshold</strong></td>
+        <td>
+          <p>
+            When set, iterations stop if the relative change in the value of the objective function $\eqref{eq:objectiveFunc}$ is less than the provided threshold, $\tau$.
+            The convergence criterion is defined as follows: $\left| \frac{f(\wv)_{i-1} - f(\wv)_i}{f(\wv)_{i-1}}\right| < \tau$.
+            (Default value: <strong>None</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>LearningRateMethod</strong></td>
+        <td>
+          <p>
+            (Default value: <strong>LearningRateMethod.Default</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Decay</strong></td>
+        <td>
+          <p>
+            (Default value: <strong>0.0</strong>)
+          </p>
+        </td>
+      </tr>
+    </tbody>
+  </table>
+
+### Loss Function
+
+The loss function which is minimized has to implement the LossFunction interface, which defines methods to compute the loss and the gradient of it.
+Either one defines ones own LossFunction or one uses the GenericLossFunction class which constructs the loss function from an outer loss function and a prediction function.
+An example can be seen here
+
+Scala
+val lossFunction = GenericLossFunction(SquaredLoss, LinearPrediction)
+
+
+The full list of supported outer loss functions can be found [here](#partial-loss-function-values).
+The full list of supported prediction functions can be found [here](#prediction-function-values).
+
+#### Partial Loss Function Values ##
+
+  <table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Function Name</th>
+        <th class="text-center">Description</th>
+        <th class="text-center">Loss</th>
+        <th class="text-center">Loss Derivative</th>
+      </tr>
+    <tbody>
+      <tr>
+        <td><strong>SquaredLoss</strong></td>
+        <td>
+          <p>
+            Loss function most commonly used for regression tasks.
+          </p>
+        </td>
+        <td class="text-center">$\frac{1}{2} (\wv^T \cdot \x - y)^2$</td>
+        <td class="text-center">$\wv^T \cdot \x - y$</td>
+      </tr>
+    </tbody>
+  </table>
+
+#### Prediction Function Values ##
+
+  <table class="table table-bordered">
+        <tr>
+          <th class="text-left" style="width: 20%">Function Name</th>
+          <th class="text-center">Description</th>
+          <th class="text-center">Prediction</th>
+        </tr>
+      <tbody>
+        <tr>
+          <td><strong>LinearPrediction</strong></td>
+          <td>
+            <p>
+              The function most commonly used for linear models, such as linear regression and
+              linear classifiers.
+            </p>
+          </td>
+          <td class="text-center">$\x^T \cdot \wv$</td>
+          <td class="text-center">$\x$</td>
+        </tr>
+      </tbody>
+    </table>
+
+#### Effective Learning Rate ##
+
+Where:
+
+- $j$ is the iteration number
+
+- $\eta_j$ is the step size on step $j$
+
+- $\eta_0$ is the initial step size
+
+- $\lambda$ is the regularization constant
+
+- $\tau$ is the decay constant, which causes the learning rate to be a decreasing function of $j$, that is to say as iterations increase, learning rate decreases. The exact rate of decay is function specific, see **Inverse Scaling** and **Wei Xu's Method** (which is an extension of the **Inverse Scaling** method).
+
+<table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Function Name</th>
+        <th class="text-center">Description</th>
+        <th class="text-center">Function</th>
+        <th class="text-center">Called As</th>
+      </tr>
+    <tbody>
+      <tr>
+        <td><strong>Default</strong></td>
+        <td>
+          <p>
+            The function default method used for determining the step size. This is equivalent to the inverse scaling method for $\tau$ = 0.5.  This special case is kept as the default to maintain backwards compatibility.
+          </p>
+        </td>
+        <td class="text-center">$\eta_j = \eta_0/\sqrt{j}$</td>
+        <td class="text-center"><code>LearningRateMethod.Default</code></td>
+      </tr>
+      <tr>
+        <td><strong>Constant</strong></td>
+        <td>
+          <p>
+            The step size is constant throughout the learning task.
+          </p>
+        </td>
+        <td class="text-center">$\eta_j = \eta_0$</td>
+        <td class="text-center"><code>LearningRateMethod.Constant</code></td>
+      </tr>
+      <tr>
+        <td><strong>Leon Bottou's Method</strong></td>
+        <td>
+          <p>
+            This is the <code>'optimal'</code> method of sklearn.
+            The optimal initial value $t_0$ has to be provided.
+            Sklearn uses the following heuristic: $t_0 = \max(1.0, L^\prime(-\beta, 1.0) / (\alpha \cdot \beta)$
+            with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
+          </p>
+        </td>
+        <td class="text-center">$\eta_j = 1 / (\lambda \cdot (t_0 + j -1))$</td>
+        <td class="text-center"><code>LearningRateMethod.Bottou</code></td>
+      </tr>
+      <tr>
+        <td><strong>Inverse Scaling</strong></td>
+        <td>
+          <p>
+            A very common method for determining the step size.
+          </p>
+        </td>
+        <td class="text-center">$\eta_j = \eta_0 / j^{\tau}$</td>
+        <td class="text-center"><code>LearningRateMethod.InvScaling</code></td>
+      </tr>
+      <tr>
+        <td><strong>Wei Xu's Method</strong></td>
+        <td>
+          <p>
+            Method proposed by Wei Xu in <a href="http://arxiv.org/pdf/1107.2490.pdf">Towards Optimal One Pass Large Scale Learning with
+          </p>
+        </td>
+        <td class="text-center">$\eta_j = \eta_0 \cdot (1+ \lambda \cdot \eta_0 \cdot j)^{-\tau}$</td>
+        <td class="text-center"><code>LearningRateMethod.Xu</code></td>
+      </tr>
+    </tbody>
+  </table>
+
+### Examples
+
+In the Flink implementation of SGD, given a set of examples in a DataSet[LabeledVector] and
+optionally some initial weights, we can use GradientDescentL1.optimize() in order to optimize
+the weights for the given data.
+
+The user can provide an initial DataSet[WeightVector],
+which contains one WeightVector element, or use the default weights which are all set to 0.
+A WeightVector is a container class for the weights, which separates the intercept from the
+weight vector. This allows us to avoid applying regularization to the intercept.
+
+
+
+{% highlight scala %}
+// Create stochastic gradient descent solver
+  .setLossFunction(SquaredLoss())
+  .setRegularizationConstant(0.2)
+  .setIterations(100)
+  .setLearningRate(0.01)
+  .setLearningRateMethod(LearningRateMethod.Xu(-0.75))
+
+
+// Obtain data
+val trainingDS: DataSet[LabeledVector] = ...
+
+// Optimize the weights, according to the provided data
+val weightDS = sgd.optimize(trainingDS)
+{% endhighlight %}

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/pipelines.md b/docs/dev/libs/ml/pipelines.md
new file mode 100644
index 0000000..e0f7d82
--- /dev/null
+++ b/docs/dev/libs/ml/pipelines.md
@@ -0,0 +1,441 @@
+---
+mathjax: include
+title: Looking under the hood of pipelines
+nav-title: Pipelines
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Introduction
+
+The ability to chain together different transformers and predictors is an important feature for
+any Machine Learning (ML) library. In FlinkML we wanted to provide an intuitive API,
+and at the same
+time utilize the capabilities of the Scala language to provide
+type-safe implementations of our pipelines. What we hope to achieve then is an easy to use API,
+that protects users from type errors at pre-flight (before the job is launched) time, thereby
+eliminating cases where long
+running jobs are submitted to the cluster only to see them fail due to some
+error in the series of data transformations that commonly happen in an ML pipeline.
+
+In this guide then we will describe the choices we made during the implementation of chainable
+transformers and predictors in FlinkML, and provide guidelines on how developers can create their
+own algorithms that make use of these capabilities.
+
+## The what and the why
+
+So what do we mean by "ML pipelines"? Pipelines in the ML context can be thought of as chains of
+operations that have some data as input, perform a number of transformations to that data,
+and
+then output the transformed data, either to be used as the input (features) of a predictor
+function, such as a learning model, or just output the transformed data themselves, to be used in
+some other task. The end learner can of course be a part of the pipeline as well.
+ML pipelines can often be complicated sets of operations ([in-depth explanation](http://research.google.com/pubs/pub43146.html)) and
+can become sources of errors for end-to-end learning systems.
+
+The purpose of ML pipelines is then to create a
+framework that can be used to manage the complexity introduced by these chains of operations.
+Pipelines should make it easy for developers to define chained transformations that can be
+applied to the
+training data, in order to create the end features that will be used to train a
+learning model, and then perform the same set of transformations just as easily to unlabeled
+(test) data. Pipelines should also simplify cross-validation and model selection on
+these chains of operations.
+
+Finally, by ensuring that the consecutive links in the pipeline chain "fit together" we also
+avoid costly type errors. Since each step in a pipeline can be a computationally-heavy operation,
+we want to avoid running a pipelined job, unless we are sure that all the input/output pairs in a
+pipeline "fit".
+
+
+The building blocks for pipelines in FlinkML can be found in the ml.pipeline package.
+FlinkML follows an API inspired by [sklearn](http://scikit-learn.org) which means that we have
+Estimator, Transformer and Predictor interfaces. For an in-depth look at the design of the
+sklearn API the interested reader is referred to [this](http://arxiv.org/abs/1309.0238) paper.
+In short, the Estimator is the base class from which Transformer and Predictor inherit.
+Estimator defines a fit method, and Transformer also defines a transform method and
+Predictor defines a predict method.
+
+The fit method of the Estimator performs the actual training of the model, for example
+finding the correct weights in a linear regression task, or the mean and standard deviation of
+the data in a feature scaler.
+As evident by the naming, classes that implement
+Transformer are transform operations like [scaling the input](standard_scaler.html) and
+Predictor implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/dev/libs/ml/multiple_linear_regression.html).
+Pipelines can be created by chaining together a number of Transformers, and the final link in a pipeline can be a Predictor or another Transformer.
+Pipelines that end with Predictor cannot be chained any further.
+Below is an example of how a pipeline can be formed:
+
+{% highlight scala %}
+// Training data
+val input: DataSet[LabeledVector] = ...
+// Test data
+val unlabeled: DataSet[Vector] = ...
+
+val scaler = StandardScaler()
+val polyFeatures = PolynomialFeatures()
+val mlr = MultipleLinearRegression()
+
+// Construct the pipeline
+val pipeline = scaler
+  .chainTransformer(polyFeatures)
+  .chainPredictor(mlr)
+
+// Train the pipeline (scaler and multiple linear regression)
+pipeline.fit(input)
+
+// Calculate predictions for the testing data
+val predictions: DataSet[LabeledVector] = pipeline.predict(unlabeled)
+
+{% endhighlight %}
+
+As we mentioned, FlinkML pipelines are type-safe.
+If we tried to chain a transformer with output of type A to another with input of type B we
+would get an error at pre-flight time if A != B. FlinkML achieves this kind of type-safety
+through the use of Scala's implicits.
+
+### Scala implicits
+
+If you are not familiar with Scala's implicits we can recommend [this excerpt](https://www.artima.com/pins1ed/implicit-conversions-and-parameters.html)
+from Martin Odersky's "Programming in Scala". In short, implicit conversions allow for ad-hoc
+polymorphism in Scala by providing conversions from one type to another, and implicit values
+provide the compiler with default values that can be supplied to function calls through implicit parameters.
+The combination of implicit conversions and implicit parameters is what allows us to chain transform
+and predict operations together in a type-safe manner.
+
+### Operations
+
+As we mentioned, the trait (abstract class) Estimator defines a fit method. The method has two
+parameter lists
+(i.e. is a [curried function](http://docs.scala-lang.org/tutorials/tour/currying.html)). The
+first parameter list
+takes the input (training) DataSet and the parameters for the estimator. The second parameter
+list takes one implicit parameter, of type FitOperation. FitOperation is a class that also
+defines a fit method, and this is where the actual logic of training the concrete Estimators
+should be implemented. The fit method of Estimator is essentially a wrapper around the  fit
+method of FitOperation. The predict method of Predictor and the transform method of
+Transform are designed in a similar manner, with a respective operation class.
+
+In these methods the operation object is provided as an implicit parameter.
+Scala will [look for implicits](http://docs.scala-lang.org/tutorials/FAQ/finding-implicits.html)
+in the companion object of a type, so classes that implement these interfaces should provide these
+objects as implicit objects inside the companion object.
+
+As an example we can look at the StandardScaler class. StandardScaler extends Transformer, so it has access to its fit and transform functions.
+These two functions expect objects of FitOperation and TransformOperation as implicit parameters,
+for the fit and transform methods respectively, which StandardScaler provides in its companion
+object, through transformVectors and fitVectorStandardScaler:
+
+{% highlight scala %}
+class StandardScaler extends Transformer[StandardScaler] {
+  ...
+}
+
+object StandardScaler {
+
+  ...
+
+  implicit def fitVectorStandardScaler[T <: Vector] = new FitOperation[StandardScaler, T] {
+    override def fit(instance: StandardScaler, fitParameters: ParameterMap, input: DataSet[T])
+      : Unit = {
+        ...
+      }
+
+  implicit def transformVectors[T <: Vector: VectorConverter: TypeInformation: ClassTag] = {
+      new TransformOperation[StandardScaler, T, T] {
+        override def transform(
+          instance: StandardScaler,
+          transformParameters: ParameterMap,
+          input: DataSet[T])
+        : DataSet[T] = {
+          ...
+        }
+
+}
+
+{% endhighlight %}
+
+Note that StandardScaler does **not** override the fit method of Estimator or the transform
+method of Transformer. Rather, its implementations of FitOperation and TransformOperation
+override their respective fit and transform methods, which are then called by the fit and
+transform methods of Estimator and Transformer.  Similarly, a class that implements
+Predictor should define an implicit PredictOperation object inside its companion object.
+
+#### Types and type safety
+
+Apart from the fit and transform operations that we listed above, the StandardScaler also
+provides fit and transform operations for input of type LabeledVector.
+This allows us to use the  algorithm for input that is labeled or unlabeled, and this happens
+automatically, depending on  the type of the input that we give to the fit and transform
+operations. The correct implicit operation is chosen by the compiler, depending on the input type.
+
+If we try to call the fit or transform methods with types that are not supported we will get a
+runtime error before the job is launched.
+While it would be possible to catch these kinds of errors at compile time as well, the error
+messages that we are able to provide the user would be much less informative, which is why we chose
+
+### Chaining
+
+Chaining is achieved by calling chainTransformer or chainPredictor on an object
+of a class that implements Transformer. These methods return a ChainedTransformer or
+ChainedPredictor object respectively. As we mentioned, ChainedTransformer objects can be
+chained further, while ChainedPredictor objects cannot. These classes take care of applying
+fit, transform, and predict operations for a pair of successive transformers or
+a transformer and a predictor. They also act recursively if the length of the
+chain is larger than two, since every ChainedTransformer defines a transform and fit
+operation that can be further chained with more transformers or a predictor.
+
+It is important to note that developers and users do not need to worry about chaining when
+implementing their algorithms, all this is handled automatically by FlinkML.
+
+### How to Implement a Pipeline Operator
+
+In order to support FlinkML's pipelining, algorithms have to adhere to a certain design pattern, which we will describe in this section.
+Let's assume that we want to implement a pipeline operator which changes the mean of your data.
+Since centering data is a common pre-processing step in many analysis pipelines, we will implement it as a Transformer.
+Therefore, we first create a MeanTransformer class which inherits from Transformer
+
+{% highlight scala %}
+class MeanTransformer extends Transformer[MeanTransformer] {}
+{% endhighlight %}
+
+Since we want to be able to configure the mean of the resulting data, we have to add a configuration parameter.
+
+{% highlight scala %}
+class MeanTransformer extends Transformer[MeanTransformer] {
+  def setMean(mean: Double): this.type = {
+    this
+  }
+}
+
+object MeanTransformer {
+  case object Mean extends Parameter[Double] {
+    override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  def apply(): MeanTransformer = new MeanTransformer
+}
+{% endhighlight %}
+
+Parameters are defined in the companion object of the transformer class and extend the Parameter class.
+Since the parameter instances are supposed to act as immutable keys for a parameter map, they should be implemented as case objects.
+The default value will be used if no other value has been set by the user of this component.
+If no default value has been specified, meaning that defaultValue = None, then the algorithm has to handle this situation accordingly.
+
+We can now instantiate a MeanTransformer object and set the mean value of the transformed data.
+But we still have to implement how the transformation works.
+The workflow can be separated into two phases.
+Within the first phase, the transformer learns the mean of the given training data.
+This knowledge can then be used in the second phase to transform the provided data with respect to the configured resulting mean value.
+
+The learning of the mean can be implemented within the fit operation of our Transformer, which it inherited from Estimator.
+Within the fit operation, a pipeline component is trained with respect to the given training data.
+The algorithm is, however, **not** implemented by overriding the fit method but by providing an implementation of a corresponding FitOperation for the correct type.
+Taking a look at the definition of the fit method in Estimator, which is the parent class of Transformer, reveals what why this is the case.
+
+{% highlight scala %}
+trait Estimator[Self] extends WithParameters with Serializable {
+  that: Self =>
+
+  def fit[Training](
+      training: DataSet[Training],
+      fitParameters: ParameterMap = ParameterMap.Empty)
+      (implicit fitOperation: FitOperation[Self, Training]): Unit = {
+    fitOperation.fit(this, fitParameters, training)
+  }
+}
+{% endhighlight %}
+
+We see that the fit method is called with an input data set of type Training, an optional parameter list and in the second parameter list with an implicit parameter of type FitOperation.
+Within the body of the function, first some machine learning types are registered and then the fit method of the FitOperation parameter is called.
+The instance gives itself, the parameter map and the training data set as a parameters to the method.
+Thus, all the program logic takes place within the FitOperation.
+
+The FitOperation has two type parameters.
+The first defines the pipeline operator type for which this FitOperation shall work and the second type parameter defines the type of the data set elements.
+If we first wanted to implement the MeanTransformer to work on DenseVector, we would, thus, have to provide an implementation for FitOperation[MeanTransformer, DenseVector].
+
+{% highlight scala %}
+val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
+  override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
+    val meanTrainingData: DataSet[DenseVector] = input
+      .map{ x => (x.asBreeze, 1) }
+      .reduce{
+        (left, right) =>
+          (left._1 + right._1, left._2 + right._2)
+      }
+      .map{ p => (p._1/p._2).fromBreeze }
+  }
+}
+{% endhighlight %}
+
+A FitOperation[T, I] has a fit method which is called with an instance of type T, a parameter map and an input DataSet[I].
+In our case T=MeanTransformer and I=DenseVector.
+The parameter map is necessary if our fit step depends on some parameter values which were not given directly at creation time of the Transformer.
+The FitOperation of the MeanTransformer sums the DenseVector instances of the given input data set up and divides the result by the total number of vectors.
+That way, we obtain a DataSet[DenseVector] with a single element which is the mean value.
+
+But if we look closely at the implementation, we see that the result of the mean computation is never stored anywhere.
+If we want to use this knowledge in a later step to adjust the mean of some other input, we have to keep it around.
+And here is where the parameter of type MeanTransformer which is given to the fit method comes into play.
+We can use this instance to store state, which is used by a subsequent transform operation which works on the same object.
+But first we have to extend MeanTransformer by a member field and then adjust the FitOperation implementation.
+
+{% highlight scala %}
+class MeanTransformer extends Transformer[Centering] {
+  var meanOption: Option[DataSet[DenseVector]] = None
+
+  def setMean(mean: Double): Mean = {
+  }
+}
+
+val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
+  override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
+
+    instance.meanOption = Some(input
+      .map{ x => (x.asBreeze, 1) }
+      .reduce{
+        (left, right) =>
+          (left._1 + right._1, left._2 + right._2)
+      }
+      .map{ p => (p._1/p._2).fromBreeze })
+  }
+}
+{% endhighlight %}
+
+If we look at the transform method in Transformer, we will see that we also need an implementation of TransformOperation.
+A possible mean transforming implementation could look like the following.
+
+{% highlight scala %}
+
+val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] {
+  override def transform(
+      instance: MeanTransformer,
+      transformParameters: ParameterMap,
+      input: DataSet[DenseVector])
+    : DataSet[DenseVector] = {
+    val resultingParameters = parameters ++ transformParameters
+
+    val resultingMean = resultingParameters(MeanTransformer.Mean)
+
+    instance.meanOption match {
+      case Some(trainingMean) => {
+        input.map{ new MeanTransformMapper(resultingMean) }.withBroadcastSet(trainingMean, "trainingMean")
+      }
+      case None => throw new RuntimeException("MeanTransformer has not been fitted to data.")
+    }
+  }
+}
+
+class MeanTransformMapper(resultingMean: Double) extends RichMapFunction[DenseVector, DenseVector] {
+  var trainingMean: DenseVector = null
+
+  override def open(parameters: Configuration): Unit = {
+  }
+
+  override def map(vector: DenseVector): DenseVector = {
+
+    val result = vector.asBreeze - trainingMean.asBreeze + resultingMean
+
+    result.fromBreeze
+  }
+}
+{% endhighlight %}
+
+Now we have everything implemented to fit our MeanTransformer to a training data set of DenseVector instances and to transform them.
+However, when we execute the fit operation
+
+{% highlight scala %}
+val trainingData: DataSet[DenseVector] = ...
+val meanTransformer = MeanTransformer()
+
+meanTransformer.fit(trainingData)
+{% endhighlight %}
+
+we receive the following error at runtime: "There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.math.DenseVector]".
+The reason is that the Scala compiler could not find a fitting FitOperation value with the right type parameters for the implicit parameter of the fit method.
+Therefore, it chose a fallback implicit value which gives you this error message at runtime.
+In order to make the compiler aware of our implementation, we have to define it as an implicit value and put it in the scope of the MeanTransformer's companion object.
+
+{% highlight scala %}
+object MeanTransformer{
+  implicit val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] ...
+
+  implicit val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] ...
+}
+{% endhighlight %}
+
+Now we can call fit and transform of our MeanTransformer with DataSet[DenseVector] as input.
+Furthermore, we can now use this transformer as part of an analysis pipeline where we have a DenseVector as input and expected output.
+
+{% highlight scala %}
+val trainingData: DataSet[DenseVector] = ...
+
+val mean = MeanTransformer.setMean(1.0)
+val polyFeatures = PolynomialFeatures().setDegree(3)
+
+val pipeline = mean.chainTransformer(polyFeatures)
+
+pipeline.fit(trainingData)
+{% endhighlight %}
+
+It is noteworthy that there is no additional code needed to enable chaining.
+The system automatically constructs the pipeline logic using the operations of the individual components.
+
+So far everything works fine with DenseVector.
+But what happens, if we call our transformer with LabeledVector instead?
+{% highlight scala %}
+val trainingData: DataSet[LabeledVector] = ...
+
+val mean = MeanTransformer()
+
+mean.fit(trainingData)
+{% endhighlight %}
+
+As before we see the following exception upon execution of the program: "There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.common.LabeledVector]".
+It is noteworthy, that this exception is thrown in the pre-flight phase, which means that the job has not been submitted to the runtime system.
+This has the advantage that you won't see a job which runs for a couple of days and then fails because of an incompatible pipeline component.
+Type compatibility is, thus, checked at the very beginning for the complete job.
+
+In order to make the MeanTransformer work on LabeledVector as well, we have to provide the corresponding operations.
+Consequently, we have to define a FitOperation[MeanTransformer, LabeledVector] and TransformOperation[MeanTransformer, LabeledVector, LabeledVector] as implicit values in the scope of MeanTransformer's companion object.
+
+{% highlight scala %}
+object MeanTransformer {
+  implicit val labeledVectorFitOperation = new FitOperation[MeanTransformer, LabeledVector] ...
+
+  implicit val labeledVectorTransformOperation = new TransformOperation[MeanTransformer, LabeledVector, LabeledVector] ...
+}
+{% endhighlight %}
+
+If we wanted to implement a Predictor instead of a Transformer, then we would have to provide a FitOperation, too.
+Moreover, a Predictor requires a PredictOperation which implements how predictions are calculated from testing data.

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/polynomial_features.md b/docs/dev/libs/ml/polynomial_features.md
new file mode 100644
index 0000000..676c132
--- /dev/null
+++ b/docs/dev/libs/ml/polynomial_features.md
@@ -0,0 +1,108 @@
+---
+mathjax: include
+title: Polynomial Features
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+The polynomial features transformer maps a vector into the polynomial feature space of degree $d$.
+The dimension of the input vector determines the number of polynomial factors whose values are the respective vector entries.
+Given a vector $(x, y, z, \ldots)^T$ the resulting feature vector looks like:
+
+$$\left(x, y, z, x^2, xy, y^2, yz, z^2, x^3, x^2y, x^2z, xy^2, xyz, xz^2, y^3, \ldots\right)^T$$
+
+Flink's implementation orders the polynomials in decreasing order of their degree.
+
+Given the vector $\left(3,2\right)^T$, the polynomial features vector of degree 3 would look like
+
+ $$\left(3^3, 3^2\cdot2, 3\cdot2^2, 2^3, 3^2, 3\cdot2, 2^2, 3, 2\right)^T$$
+
+This transformer can be prepended to all Transformer and Predictor implementations which expect an input of type LabeledVector or any sub-type of Vector.
+
+## Operations
+
+PolynomialFeatures is a Transformer.
+As such, it supports the fit and transform operation.
+
+### Fit
+
+PolynomialFeatures is not trained on data and, thus, supports all types of input data.
+
+### Transform
+
+PolynomialFeatures transforms all subtypes of Vector and LabeledVector into their respective types:
+
+* transform[T <: Vector]: DataSet[T] => DataSet[T]
+* transform: DataSet[LabeledVector] => DataSet[LabeledVector]
+
+## Parameters
+
+The polynomial features transformer can be controlled by the following parameters:
+
+<table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Parameters</th>
+        <th class="text-center">Description</th>
+      </tr>
+
+    <tbody>
+      <tr>
+        <td><strong>Degree</strong></td>
+        <td>
+          <p>
+            The maximum polynomial degree.
+            (Default value: <strong>10</strong>)
+          </p>
+        </td>
+      </tr>
+    </tbody>
+  </table>
+
+## Examples
+
+{% highlight scala %}
+// Obtain the training data set
+val trainingDS: DataSet[LabeledVector] = ...
+
+// Setup polynomial feature transformer of degree 3
+val polyFeatures = PolynomialFeatures()
+.setDegree(3)
+
+// Setup the multiple linear regression learner
+val mlr = MultipleLinearRegression()
+
+// Control the learner via the parameter map
+val parameters = ParameterMap()
+
+// Create pipeline PolynomialFeatures -> MultipleLinearRegression
+val pipeline = polyFeatures.chainPredictor(mlr)
+
+// train the model
+pipeline.fit(trainingDS)
+{% endhighlight %}

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/quickstart.md b/docs/dev/libs/ml/quickstart.md
new file mode 100644
index 0000000..26b9275
--- /dev/null
+++ b/docs/dev/libs/ml/quickstart.md
@@ -0,0 +1,243 @@
+---
+mathjax: include
+title: Quickstart Guide
+nav-title: Quickstart
+nav-parent_id: ml
+nav-pos: 0
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward process, abstracting away
+the complexities that usually come with big data learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [[1]](#murphy) ML deals with detecting patterns in data, and using those
+learned patterns to make predictions about the future. We can categorize most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
+(features) to a set of outputs. The learning is done using a *training set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised learning problems are
+further divided into classification and regression problems. In classification problems we try to
+predict the *class* that an example belongs to, for example whether a user is going to click on
+an ad or not. Regression problems one the other hand, are about predicting (real) numerical
+values, often called the dependent variable, for example what the temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the data from the
+descriptive features. Unsupervised learning can also be used for feature selection, for example
+through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+
+In order to use FlinkML in your project, first you have to
+Next, you have to add the FlinkML dependency to the pom.xml of your project:
+
+{% highlight xml %}
+<dependency>
+  <version>{{site.version }}</version>
+</dependency>
+{% endhighlight %}
+
+
+To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised learning problems it is
+common to use the LabeledVector class to represent the (label, features) examples. A LabeledVector
+object will have a FlinkML Vector member representing the features of the example and a Double
+member which represents the label, which could be the class in a classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+This dataset *"contains cases from a study conducted on the survival of patients who had undergone
+surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
+are the features and last column is the class, and the 4th column indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
+
+We can load the data as a DataSet[String] first:
+
+{% highlight scala %}
+
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
+
+{% endhighlight %}
+
+We can now transform the data into a DataSet[LabeledVector]. This will allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
+is the class label, and the rest are features, so we can build LabeledVector elements like this:
+
+{% highlight scala %}
+
+
+val survivalLV = survival
+  .map{tuple =>
+    val list = tuple.productIterator.toList
+    val numList = list.map(_.asInstanceOf[String].toDouble)
+    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
+  }
+
+{% endhighlight %}
+
+We can then use this data to train a learner. We will however use another dataset to exemplify
+building a learner; that will allow us to show how we can import other dataset formats.
+
+**LibSVM files**
+
+A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
+datasets using the LibSVM format through the readLibSVM function available through the MLUtils
+object.
+You can also save datasets in the LibSVM format using the writeLibSVM function.
+[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
+and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
+This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
+practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+
+
+{% endhighlight %}
+
+This gives us two DataSet[LabeledVector] objects that we will use in the following section to
+create a classifier.
+
+## Classification
+
+Once we have imported the dataset we can train a Predictor such as a linear SVM classifier.
+We can set a number of parameters for the classifier. Here we set the Blocks parameter,
+which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
+regularization parameter determines the amount of $l_2$ regularization applied, which is used
+to avoid overfitting. The step size determines the contribution of the weight vector updates to
+the next weight vector value. This parameter sets the initial step size.
+
+{% highlight scala %}
+
+
+val svm = SVM()
+  .setBlocks(env.getParallelism)
+  .setIterations(100)
+  .setRegularization(0.001)
+  .setStepsize(0.1)
+  .setSeed(42)
+
+svm.fit(astroTrain)
+
+{% endhighlight %}
+
+We can now make predictions on the test set.
+
+{% highlight scala %}
+
+val predictionPairs = svm.predict(astroTest)
+
+{% endhighlight %}
+
+Next we will see how we can pre-process our data, and use the ML pipelines capabilities of FlinkML.
+
+## Data pre-processing and pipelines
+
+A pre-processing step that is often encouraged [[3]](#hsu) when using SVM classification is scaling
+the input features to the [0, 1] range, in order to avoid features with extreme values
+dominating the rest.
+FlinkML has a number of Transformers such as MinMaxScaler that are used to pre-process data,
+and a key feature is the ability to chain Transformers and Predictors together. This allows
+us to run the same pipeline of transformations and make predictions on the train and test data in
+a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML
+[in the pipelines documentation](pipelines.html).
+
+Let us first create a normalizing transformer for the features in our dataset, and chain it to a
+new SVM classifier.
+
+{% highlight scala %}
+
+
+val scaler = MinMaxScaler()
+
+val scaledSVM = scaler.chainPredictor(svm)
+
+{% endhighlight %}
+
+We can now use our newly created pipeline to make predictions on the test set.
+First we call fit again, to train the scaler and the SVM classifier.
+The data of the test set will then be automatically scaled before being passed on to the SVM to
+make predictions.
+
+{% highlight scala %}
+
+scaledSVM.fit(astroTrain)
+
+val predictionPairsScaled: DataSet[(Double, Double)] = scaledSVM.predict(astroTest)
+
+{% endhighlight %}
+
+The scaled inputs should give us better prediction performance.
+The result of the prediction on LabeledVectors is a data set of tuples where the first entry denotes the true label value and the second entry is the predicted label value.
+
+## Where to go from here
+
+This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot
+more you can do.
+We recommend going through the [FlinkML documentation]({{ site.baseurl }}/dev/libs/ml/index.html), and trying out the different
+algorithms.
+A very good way to get started is to play around with interesting datasets from the UCI ML
+repository and the LibSVM datasets.
+Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or
+[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other
+data scientists.
+If you would like to contribute some new algorithms take a look at our
+[contribution guide](contribution_guide.html).
+
+**References**
+
+<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
+press, 2012.
+
+<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communication-efficient distributed dual
+coordinate ascent.* Advances in Neural Information Processing Systems. 2014.
+
+<a name="hsu"></a>[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin.
+ *A practical guide to support vector classification.* 2003.

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/standard_scaler.md b/docs/dev/libs/ml/standard_scaler.md
new file mode 100644
index 0000000..5104d3c
--- /dev/null
+++ b/docs/dev/libs/ml/standard_scaler.md
@@ -0,0 +1,113 @@
+---
+mathjax: include
+title: Standard Scaler
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
+ In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1.
+ Given a set of input data $x_1, x_2,... x_n$, with mean:
+
+ $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
+
+ and standard deviation:
+
+ $$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= std \left (\frac{x_{i} - \bar{x} }{\sigma_{x}}\right ) + mean$$
+
+where $\textit{std}$ and $\textit{mean}$ are the user specified values for the standard deviation and mean.
+
+## Operations
+
+StandardScaler is a Transformer.
+As such, it supports the fit and transform operation.
+
+### Fit
+
+StandardScaler is trained on all subtypes of Vector or LabeledVector:
+
+* fit[T <: Vector]: DataSet[T] => Unit
+* fit: DataSet[LabeledVector] => Unit
+
+### Transform
+
+StandardScaler transforms all subtypes of Vector or LabeledVector into the respective type:
+
+* transform[T <: Vector]: DataSet[T] => DataSet[T]
+* transform: DataSet[LabeledVector] => DataSet[LabeledVector]
+
+## Parameters
+
+The standard scaler implementation can be controlled by the following two parameters:
+
+ <table class="table table-bordered">
+    <tr>
+      <th class="text-left" style="width: 20%">Parameters</th>
+      <th class="text-center">Description</th>
+    </tr>
+
+  <tbody>
+    <tr>
+      <td><strong>Mean</strong></td>
+      <td>
+        <p>
+          The mean of the scaled data set. (Default value: <strong>0.0</strong>)
+        </p>
+      </td>
+    </tr>
+    <tr>
+      <td><strong>Std</strong></td>
+      <td>
+        <p>
+          The standard deviation of the scaled data set. (Default value: <strong>1.0</strong>)
+        </p>
+      </td>
+    </tr>
+  </tbody>
+</table>
+
+## Examples
+
+{% highlight scala %}
+// Create standard scaler transformer
+val scaler = StandardScaler()
+.setMean(10.0)
+.setStd(2.0)
+
+// Obtain data set to be scaled
+val dataSet: DataSet[Vector] = ...
+
+// Learn the mean and standard deviation of the training data
+scaler.fit(dataSet)
+
+// Scale the provided data set to have mean=10.0 and std=2.0
+val scaledDS = scaler.transform(dataSet)
+{% endhighlight %}

----------------------------------------------------------------------
diff --git a/docs/dev/libs/ml/svm.md b/docs/dev/libs/ml/svm.md
new file mode 100644
index 0000000..34fa1ec
--- /dev/null
+++ b/docs/dev/libs/ml/svm.md
@@ -0,0 +1,220 @@
+---
+mathjax: include
+title: SVM using CoCoA
+nav-parent_id: ml
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+Implements an SVM with soft-margin using the communication-efficient distributed dual coordinate
+ascent algorithm with hinge-loss function.
+The algorithm solves the following minimization problem:
+
+$$\min_{\mathbf{w} \in \mathbb{R}^d} \frac{\lambda}{2} \left\lVert \mathbf{w} \right\rVert^2 + \frac{1}{n} \sum_{i=1}^n l_{i}\left(\mathbf{w}^T\mathbf{x}_i\right)$$
+
+with $\mathbf{w}$ being the weight vector, $\lambda$ being the regularization constant,
+$$\mathbf{x}_i \in \mathbb{R}^d$$ being the data points and $$l_{i}$$ being the convex loss
+functions, which can also depend on the labels $$y_{i} \in \mathbb{R}$$.
+In the current implementation the regularizer is the $\ell_2$-norm and the loss functions are the hinge-loss functions:
+
+  $$l_{i} = \max\left(0, 1 - y_{i} \mathbf{w}^T\mathbf{x}_i \right)$$
+
+With these choices, the problem definition is equivalent to a SVM with soft-margin.
+Thus, the algorithm allows us to train a SVM with soft-margin.
+
+The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA).
+In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm calculates
+several iterations of SDCA locally on a data block before merging the local updates into a
+valid global state.
+This state is redistributed to the different data partitions where the next round of local SDCA
+iterations is then executed.
+The number of outer iterations and local SDCA iterations control the overall network costs, because
+there is only network communication required for each outer iteration.
+The local SDCA iterations are embarrassingly parallel once the individual data partitions have been
+distributed across the cluster.
+
+The implementation of this algorithm is based on the work of
+[Jaggi et al.](http://arxiv.org/abs/1409.1458)
+
+## Operations
+
+SVM is a Predictor.
+As such, it supports the fit and predict operation.
+
+### Fit
+
+SVM is trained given a set of LabeledVector:
+
+* fit: DataSet[LabeledVector] => Unit
+
+### Predict
+
+SVM predicts for all subtypes of FlinkML's Vector the corresponding class label:
+
+* predict[T <: Vector]: DataSet[T] => DataSet[(T, Double)], where the (T, Double) tuple
+  corresponds to (original_features, label)
+
+If we call evaluate with a DataSet[(Vector, Double)], we make a prediction on the class label
+for each example, and return a DataSet[(Double, Double)]. In each tuple the first element
+is the true value, as was provided from the input DataSet[(Vector, Double)] and the second element
+is the predicted value. You can then use these (truth, prediction) tuples to evaluate
+the algorithm's performance.
+
+* predict: DataSet[(Vector, Double)] => DataSet[(Double, Double)]
+
+## Parameters
+
+The SVM implementation can be controlled by the following parameters:
+
+<table class="table table-bordered">
+  <tr>
+    <th class="text-left" style="width: 20%">Parameters</th>
+    <th class="text-center">Description</th>
+  </tr>
+
+<tbody>
+  <tr>
+    <td><strong>Blocks</strong></td>
+    <td>
+      <p>
+        Sets the number of blocks into which the input data will be split.
+        On each block the local stochastic dual coordinate ascent method is executed.
+        This number should be set at least to the degree of parallelism.
+        If no value is specified, then the parallelism of the input DataSet is used as the number of blocks.
+        (Default value: <strong>None</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+    <td><strong>Iterations</strong></td>
+    <td>
+      <p>
+        Defines the maximum number of iterations of the outer loop method.
+        In other words, it defines how often the SDCA method is applied to the blocked data.
+        After each iteration, the locally computed weight vector updates have to be reduced to update the global weight vector value.
+        The new weight vector is broadcast to all SDCA tasks at the beginning of each iteration.
+        (Default value: <strong>10</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+    <td><strong>LocalIterations</strong></td>
+    <td>
+      <p>
+        Defines the maximum number of SDCA iterations.
+        In other words, it defines how many data points are drawn from each local data block to calculate the stochastic dual coordinate ascent.
+        (Default value: <strong>10</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+    <td><strong>Regularization</strong></td>
+    <td>
+      <p>
+        Defines the regularization constant of the SVM algorithm.
+        The higher the value, the smaller will the 2-norm of the weight vector be.
+        In case of a SVM with hinge loss this means that the SVM margin will be wider even though it might contain some false classifications.
+        (Default value: <strong>1.0</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+    <td><strong>Stepsize</strong></td>
+    <td>
+      <p>
+        Defines the initial step size for the updates of the weight vector.
+        The larger the step size is, the larger will be the contribution of the weight vector updates to the next weight vector value.
+        The effective scaling of the updates is $\frac{stepsize}{blocks}$.
+        This value has to be tuned in case that the algorithm becomes unstable.
+        (Default value: <strong>1.0</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+    <td><strong>ThresholdValue</strong></td>
+    <td>
+      <p>
+        Defines the limiting value for the decision function above which examples are labeled as
+        positive (+1.0). Examples with a decision function value below this value are classified
+        as negative (-1.0). In order to get the raw decision function values you need to indicate it by
+        using the OutputDecisionFunction parameter.  (Default value: <strong>0.0</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+    <td><strong>OutputDecisionFunction</strong></td>
+    <td>
+      <p>
+        Determines whether the predict and evaluate functions of the SVM should return the distance
+        to the separating hyperplane, or binary class labels. Setting this to true will
+        return the raw distance to the hyperplane for each example. Setting it to false will
+        return the binary class label (+1.0, -1.0) (Default value: <strong>false</strong>)
+      </p>
+    </td>
+  </tr>
+  <tr>
+  <td><strong>Seed</strong></td>
+  <td>
+    <p>
+      Defines the seed to initialize the random number generator.
+      The seed directly controls which data points are chosen for the SDCA method.
+      (Default value: <strong>Random Long Integer</strong>)
+    </p>
+  </td>
+</tr>
+</tbody>
+</table>
+
+## Examples
+
+{% highlight scala %}
+
+val pathToTrainingFile: String = ???
+val pathToTestingFile: String = ???
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+// Read the training data set, from a LibSVM formatted file
+
+// Create the SVM learner
+val svm = SVM()
+  .setBlocks(10)
+
+// Learn the SVM model
+svm.fit(trainingDS)
+
+// Read the testing data set